Banque Misr - 2nd Interview "Use Case"
By: Hala Sedki
This case study demonstrates how to apply Exploratory Data Analysis (EDA) in a real-world business context. It not only employs EDA techniques but also provides insights into risk analytics within banking and financial services, illustrating how data can be leveraged to reduce the risk of financial loss when lending to customers.
Loan providers often struggle to approve loans due to insufficient or non-existent credit histories of applicants. This issue can be exploited by some consumers who may default on their loans. In a consumer finance company specializing in urban loans, it is crucial to use Exploratory Data Analysis (EDA) to identify patterns and ensure that applicants who can repay their loans are not unjustly rejected.
When assessing a loan application, the company faces two main risks:
The dataset includes information about loan applications, categorized into two scenarios:
Decisions made during the loan process fall into four categories:
This case study will utilize EDA to explore how both consumer and loan attributes impact the likelihood of default.
This case study seeks to uncover patterns that signal potential difficulties in repaying loan installments. By identifying these patterns, the company can make informed decisions such as rejecting high-risk loans, adjusting loan amounts, or charging higher interest rates for riskier applicants. This approach helps ensure that only those capable of repaying are approved.
In essence, the goal is to pinpoint the key factors or variables that strongly indicate the likelihood of loan default. Understanding these driver variables will enable the company to better assess and manage risk within its loan portfolio.
We will start by reviewing the CSV file containing the current loan applications from customers. After analyzing this data, we will examine past application details for additional insights in a later phase.
# Import the required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.style as style
import seaborn as sns
import itertools
%matplotlib inline
# Filter out the warnings
import warnings
warnings.filterwarnings('ignore')
# Setting maximum rows and columns display size to 200 for better visibility of data
pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 200)
# Read the application data file
application_df = pd.read_csv('application_data.csv')
application_df.head()
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | FONDKAPREMONT_MODE | HOUSETYPE_MODE | TOTALAREA_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | 351000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.018801 | -9461 | -637 | -3648.0 | -2120 | NaN | 1 | 1 | 0 | 1 | 1 | 0 | Laborers | 1.0 | 2 | 2 | WEDNESDAY | 10 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.083037 | 0.262949 | 0.139376 | 0.0247 | 0.0369 | 0.9722 | 0.6192 | 0.0143 | 0.00 | 0.0690 | 0.0833 | 0.1250 | 0.0369 | 0.0202 | 0.0190 | 0.0000 | 0.0000 | 0.0252 | 0.0383 | 0.9722 | 0.6341 | 0.0144 | 0.0000 | 0.0690 | 0.0833 | 0.1250 | 0.0377 | 0.022 | 0.0198 | 0.0 | 0.0 | 0.0250 | 0.0369 | 0.9722 | 0.6243 | 0.0144 | 0.00 | 0.0690 | 0.0833 | 0.1250 | 0.0375 | 0.0205 | 0.0193 | 0.0000 | 0.00 | reg oper account | block of flats | 0.0149 | Stone, brick | No | 2.0 | 2.0 | 2.0 | 2.0 | -1134.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | Family | State servant | Higher education | Married | House / apartment | 0.003541 | -16765 | -1188 | -1186.0 | -291 | NaN | 1 | 1 | 0 | 1 | 1 | 0 | Core staff | 2.0 | 1 | 1 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | School | 0.311267 | 0.622246 | NaN | 0.0959 | 0.0529 | 0.9851 | 0.7960 | 0.0605 | 0.08 | 0.0345 | 0.2917 | 0.3333 | 0.0130 | 0.0773 | 0.0549 | 0.0039 | 0.0098 | 0.0924 | 0.0538 | 0.9851 | 0.8040 | 0.0497 | 0.0806 | 0.0345 | 0.2917 | 0.3333 | 0.0128 | 0.079 | 0.0554 | 0.0 | 0.0 | 0.0968 | 0.0529 | 0.9851 | 0.7987 | 0.0608 | 0.08 | 0.0345 | 0.2917 | 0.3333 | 0.0132 | 0.0787 | 0.0558 | 0.0039 | 0.01 | reg oper account | block of flats | 0.0714 | Block | No | 1.0 | 0.0 | 1.0 | 0.0 | -828.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | 135000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.010032 | -19046 | -225 | -4260.0 | -2531 | 26.0 | 1 | 1 | 1 | 1 | 1 | 0 | Laborers | 1.0 | 2 | 2 | MONDAY | 9 | 0 | 0 | 0 | 0 | 0 | 0 | Government | NaN | 0.555912 | 0.729567 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | -815.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | 297000.0 | Unaccompanied | Working | Secondary / secondary special | Civil marriage | House / apartment | 0.008019 | -19005 | -3039 | -9833.0 | -2437 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | Laborers | 2.0 | 2 | 2 | WEDNESDAY | 17 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | NaN | 0.650442 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2.0 | 0.0 | 2.0 | 0.0 | -617.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | 513000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.028663 | -19932 | -3038 | -4311.0 | -3458 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | Core staff | 1.0 | 2 | 2 | THURSDAY | 11 | 0 | 0 | 0 | 0 | 1 | 1 | Religion | NaN | 0.322738 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | -1106.0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
# Check the number of rows and columns in the dataframe
application_df.shape
(307511, 122)
# Check the column-wise info of the dataframe
application_df.info(verbose=True)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 307511 entries, 0 to 307510 Data columns (total 122 columns): # Column Dtype --- ------ ----- 0 SK_ID_CURR int64 1 TARGET int64 2 NAME_CONTRACT_TYPE object 3 CODE_GENDER object 4 FLAG_OWN_CAR object 5 FLAG_OWN_REALTY object 6 CNT_CHILDREN int64 7 AMT_INCOME_TOTAL float64 8 AMT_CREDIT float64 9 AMT_ANNUITY float64 10 AMT_GOODS_PRICE float64 11 NAME_TYPE_SUITE object 12 NAME_INCOME_TYPE object 13 NAME_EDUCATION_TYPE object 14 NAME_FAMILY_STATUS object 15 NAME_HOUSING_TYPE object 16 REGION_POPULATION_RELATIVE float64 17 DAYS_BIRTH int64 18 DAYS_EMPLOYED int64 19 DAYS_REGISTRATION float64 20 DAYS_ID_PUBLISH int64 21 OWN_CAR_AGE float64 22 FLAG_MOBIL int64 23 FLAG_EMP_PHONE int64 24 FLAG_WORK_PHONE int64 25 FLAG_CONT_MOBILE int64 26 FLAG_PHONE int64 27 FLAG_EMAIL int64 28 OCCUPATION_TYPE object 29 CNT_FAM_MEMBERS float64 30 REGION_RATING_CLIENT int64 31 REGION_RATING_CLIENT_W_CITY int64 32 WEEKDAY_APPR_PROCESS_START object 33 HOUR_APPR_PROCESS_START int64 34 REG_REGION_NOT_LIVE_REGION int64 35 REG_REGION_NOT_WORK_REGION int64 36 LIVE_REGION_NOT_WORK_REGION int64 37 REG_CITY_NOT_LIVE_CITY int64 38 REG_CITY_NOT_WORK_CITY int64 39 LIVE_CITY_NOT_WORK_CITY int64 40 ORGANIZATION_TYPE object 41 EXT_SOURCE_1 float64 42 EXT_SOURCE_2 float64 43 EXT_SOURCE_3 float64 44 APARTMENTS_AVG float64 45 BASEMENTAREA_AVG float64 46 YEARS_BEGINEXPLUATATION_AVG float64 47 YEARS_BUILD_AVG float64 48 COMMONAREA_AVG float64 49 ELEVATORS_AVG float64 50 ENTRANCES_AVG float64 51 FLOORSMAX_AVG float64 52 FLOORSMIN_AVG float64 53 LANDAREA_AVG float64 54 LIVINGAPARTMENTS_AVG float64 55 LIVINGAREA_AVG float64 56 NONLIVINGAPARTMENTS_AVG float64 57 NONLIVINGAREA_AVG float64 58 APARTMENTS_MODE float64 59 BASEMENTAREA_MODE float64 60 YEARS_BEGINEXPLUATATION_MODE float64 61 YEARS_BUILD_MODE float64 62 COMMONAREA_MODE float64 63 ELEVATORS_MODE float64 64 ENTRANCES_MODE float64 65 FLOORSMAX_MODE float64 66 FLOORSMIN_MODE float64 67 LANDAREA_MODE float64 68 LIVINGAPARTMENTS_MODE float64 69 LIVINGAREA_MODE float64 70 NONLIVINGAPARTMENTS_MODE float64 71 NONLIVINGAREA_MODE float64 72 APARTMENTS_MEDI float64 73 BASEMENTAREA_MEDI float64 74 YEARS_BEGINEXPLUATATION_MEDI float64 75 YEARS_BUILD_MEDI float64 76 COMMONAREA_MEDI float64 77 ELEVATORS_MEDI float64 78 ENTRANCES_MEDI float64 79 FLOORSMAX_MEDI float64 80 FLOORSMIN_MEDI float64 81 LANDAREA_MEDI float64 82 LIVINGAPARTMENTS_MEDI float64 83 LIVINGAREA_MEDI float64 84 NONLIVINGAPARTMENTS_MEDI float64 85 NONLIVINGAREA_MEDI float64 86 FONDKAPREMONT_MODE object 87 HOUSETYPE_MODE object 88 TOTALAREA_MODE float64 89 WALLSMATERIAL_MODE object 90 EMERGENCYSTATE_MODE object 91 OBS_30_CNT_SOCIAL_CIRCLE float64 92 DEF_30_CNT_SOCIAL_CIRCLE float64 93 OBS_60_CNT_SOCIAL_CIRCLE float64 94 DEF_60_CNT_SOCIAL_CIRCLE float64 95 DAYS_LAST_PHONE_CHANGE float64 96 FLAG_DOCUMENT_2 int64 97 FLAG_DOCUMENT_3 int64 98 FLAG_DOCUMENT_4 int64 99 FLAG_DOCUMENT_5 int64 100 FLAG_DOCUMENT_6 int64 101 FLAG_DOCUMENT_7 int64 102 FLAG_DOCUMENT_8 int64 103 FLAG_DOCUMENT_9 int64 104 FLAG_DOCUMENT_10 int64 105 FLAG_DOCUMENT_11 int64 106 FLAG_DOCUMENT_12 int64 107 FLAG_DOCUMENT_13 int64 108 FLAG_DOCUMENT_14 int64 109 FLAG_DOCUMENT_15 int64 110 FLAG_DOCUMENT_16 int64 111 FLAG_DOCUMENT_17 int64 112 FLAG_DOCUMENT_18 int64 113 FLAG_DOCUMENT_19 int64 114 FLAG_DOCUMENT_20 int64 115 FLAG_DOCUMENT_21 int64 116 AMT_REQ_CREDIT_BUREAU_HOUR float64 117 AMT_REQ_CREDIT_BUREAU_DAY float64 118 AMT_REQ_CREDIT_BUREAU_WEEK float64 119 AMT_REQ_CREDIT_BUREAU_MON float64 120 AMT_REQ_CREDIT_BUREAU_QRT float64 121 AMT_REQ_CREDIT_BUREAU_YEAR float64 dtypes: float64(65), int64(41), object(16) memory usage: 286.2+ MB
# Check the summary for the numeric columns
application_df.describe()
| SK_ID_CURR | TARGET | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | TOTALAREA_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 307511.000000 | 307511.000000 | 307511.000000 | 3.075110e+05 | 3.075110e+05 | 307499.000000 | 3.072330e+05 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 104582.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307509.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 134133.000000 | 3.068510e+05 | 246546.000000 | 151450.00000 | 127568.000000 | 157504.000000 | 103023.000000 | 92646.000000 | 143620.000000 | 152683.000000 | 154491.000000 | 98869.000000 | 124921.000000 | 97312.000000 | 153161.000000 | 93997.000000 | 137829.000000 | 151450.000000 | 127568.000000 | 157504.000000 | 103023.000000 | 92646.000000 | 143620.000000 | 152683.000000 | 154491.000000 | 98869.000000 | 124921.000000 | 97312.000000 | 153161.000000 | 93997.000000 | 137829.000000 | 151450.000000 | 127568.000000 | 157504.000000 | 103023.000000 | 92646.000000 | 143620.000000 | 152683.000000 | 154491.000000 | 98869.000000 | 124921.000000 | 97312.000000 | 153161.000000 | 93997.000000 | 137829.000000 | 159080.000000 | 306490.000000 | 306490.000000 | 306490.000000 | 306490.000000 | 307510.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.00000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 |
| mean | 278180.518577 | 0.080729 | 0.417052 | 1.687979e+05 | 5.990260e+05 | 27108.573909 | 5.383962e+05 | 0.020868 | -16036.995067 | 63815.045904 | -4986.120328 | -2994.202373 | 12.061091 | 0.999997 | 0.819889 | 0.199368 | 0.998133 | 0.281066 | 0.056720 | 2.152665 | 2.052463 | 2.031521 | 12.063419 | 0.015144 | 0.050769 | 0.040659 | 0.078173 | 0.230454 | 0.179555 | 0.502130 | 5.143927e-01 | 0.510853 | 0.11744 | 0.088442 | 0.977735 | 0.752471 | 0.044621 | 0.078942 | 0.149725 | 0.226282 | 0.231894 | 0.066333 | 0.100775 | 0.107399 | 0.008809 | 0.028358 | 0.114231 | 0.087543 | 0.977065 | 0.759637 | 0.042553 | 0.074490 | 0.145193 | 0.222315 | 0.228058 | 0.064958 | 0.105645 | 0.105975 | 0.008076 | 0.027022 | 0.117850 | 0.087955 | 0.977752 | 0.755746 | 0.044595 | 0.078078 | 0.149213 | 0.225897 | 0.231625 | 0.067169 | 0.101954 | 0.108607 | 0.008651 | 0.028236 | 0.102547 | 1.422245 | 0.143421 | 1.405292 | 0.100049 | -962.858788 | 0.000042 | 0.710023 | 0.000081 | 0.015115 | 0.088055 | 0.000192 | 0.081376 | 0.003896 | 0.000023 | 0.003912 | 0.000007 | 0.003525 | 0.002936 | 0.00121 | 0.009928 | 0.000267 | 0.008130 | 0.000595 | 0.000507 | 0.000335 | 0.006402 | 0.007000 | 0.034362 | 0.267395 | 0.265474 | 1.899974 |
| std | 102790.175348 | 0.272419 | 0.722121 | 2.371231e+05 | 4.024908e+05 | 14493.737315 | 3.694465e+05 | 0.013831 | 4363.988632 | 141275.766519 | 3522.886321 | 1509.450419 | 11.944812 | 0.001803 | 0.384280 | 0.399526 | 0.043164 | 0.449521 | 0.231307 | 0.910682 | 0.509034 | 0.502737 | 3.265832 | 0.122126 | 0.219526 | 0.197499 | 0.268444 | 0.421124 | 0.383817 | 0.211062 | 1.910602e-01 | 0.194844 | 0.10824 | 0.082438 | 0.059223 | 0.113280 | 0.076036 | 0.134576 | 0.100049 | 0.144641 | 0.161380 | 0.081184 | 0.092576 | 0.110565 | 0.047732 | 0.069523 | 0.107936 | 0.084307 | 0.064575 | 0.110111 | 0.074445 | 0.132256 | 0.100977 | 0.143709 | 0.161160 | 0.081750 | 0.097880 | 0.111845 | 0.046276 | 0.070254 | 0.109076 | 0.082179 | 0.059897 | 0.112066 | 0.076144 | 0.134467 | 0.100368 | 0.145067 | 0.161934 | 0.082167 | 0.093642 | 0.112260 | 0.047415 | 0.070166 | 0.107462 | 2.400989 | 0.446698 | 2.379803 | 0.362291 | 826.808487 | 0.006502 | 0.453752 | 0.009016 | 0.122010 | 0.283376 | 0.013850 | 0.273412 | 0.062295 | 0.004771 | 0.062424 | 0.002550 | 0.059268 | 0.054110 | 0.03476 | 0.099144 | 0.016327 | 0.089798 | 0.024387 | 0.022518 | 0.018299 | 0.083849 | 0.110757 | 0.204685 | 0.916002 | 0.794056 | 1.869295 |
| min | 100002.000000 | 0.000000 | 0.000000 | 2.565000e+04 | 4.500000e+04 | 1615.500000 | 4.050000e+04 | 0.000290 | -25229.000000 | -17912.000000 | -24672.000000 | -7197.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.014568 | 8.173617e-08 | 0.000527 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -4292.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 189145.500000 | 0.000000 | 0.000000 | 1.125000e+05 | 2.700000e+05 | 16524.000000 | 2.385000e+05 | 0.010006 | -19682.000000 | -2760.000000 | -7479.500000 | -4299.000000 | 5.000000 | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 2.000000 | 2.000000 | 2.000000 | 10.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.334007 | 3.924574e-01 | 0.370650 | 0.05770 | 0.044200 | 0.976700 | 0.687200 | 0.007800 | 0.000000 | 0.069000 | 0.166700 | 0.083300 | 0.018700 | 0.050400 | 0.045300 | 0.000000 | 0.000000 | 0.052500 | 0.040700 | 0.976700 | 0.699400 | 0.007200 | 0.000000 | 0.069000 | 0.166700 | 0.083300 | 0.016600 | 0.054200 | 0.042700 | 0.000000 | 0.000000 | 0.058300 | 0.043700 | 0.976700 | 0.691400 | 0.007900 | 0.000000 | 0.069000 | 0.166700 | 0.083300 | 0.018700 | 0.051300 | 0.045700 | 0.000000 | 0.000000 | 0.041200 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -1570.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 278202.000000 | 0.000000 | 0.000000 | 1.471500e+05 | 5.135310e+05 | 24903.000000 | 4.500000e+05 | 0.018850 | -15750.000000 | -1213.000000 | -4504.000000 | -3254.000000 | 9.000000 | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 2.000000 | 2.000000 | 2.000000 | 12.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.505998 | 5.659614e-01 | 0.535276 | 0.08760 | 0.076300 | 0.981600 | 0.755200 | 0.021100 | 0.000000 | 0.137900 | 0.166700 | 0.208300 | 0.048100 | 0.075600 | 0.074500 | 0.000000 | 0.003600 | 0.084000 | 0.074600 | 0.981600 | 0.764800 | 0.019000 | 0.000000 | 0.137900 | 0.166700 | 0.208300 | 0.045800 | 0.077100 | 0.073100 | 0.000000 | 0.001100 | 0.086400 | 0.075800 | 0.981600 | 0.758500 | 0.020800 | 0.000000 | 0.137900 | 0.166700 | 0.208300 | 0.048700 | 0.076100 | 0.074900 | 0.000000 | 0.003100 | 0.068800 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -757.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 75% | 367142.500000 | 0.000000 | 1.000000 | 2.025000e+05 | 8.086500e+05 | 34596.000000 | 6.795000e+05 | 0.028663 | -12413.000000 | -289.000000 | -2010.000000 | -1720.000000 | 15.000000 | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 3.000000 | 2.000000 | 2.000000 | 14.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.675053 | 6.636171e-01 | 0.669057 | 0.14850 | 0.112200 | 0.986600 | 0.823200 | 0.051500 | 0.120000 | 0.206900 | 0.333300 | 0.375000 | 0.085600 | 0.121000 | 0.129900 | 0.003900 | 0.027700 | 0.143900 | 0.112400 | 0.986600 | 0.823600 | 0.049000 | 0.120800 | 0.206900 | 0.333300 | 0.375000 | 0.084100 | 0.131300 | 0.125200 | 0.003900 | 0.023100 | 0.148900 | 0.111600 | 0.986600 | 0.825600 | 0.051300 | 0.120000 | 0.206900 | 0.333300 | 0.375000 | 0.086800 | 0.123100 | 0.130300 | 0.003900 | 0.026600 | 0.127600 | 2.000000 | 0.000000 | 2.000000 | 0.000000 | -274.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.000000 |
| max | 456255.000000 | 1.000000 | 19.000000 | 1.170000e+08 | 4.050000e+06 | 258025.500000 | 4.050000e+06 | 0.072508 | -7489.000000 | 365243.000000 | 0.000000 | 0.000000 | 91.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 20.000000 | 3.000000 | 3.000000 | 23.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.962693 | 8.549997e-01 | 0.896010 | 1.00000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 348.000000 | 34.000000 | 344.000000 | 24.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.00000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 4.000000 | 9.000000 | 8.000000 | 27.000000 | 261.000000 | 25.000000 |
In this section we will perform data quality check by identifying missing values, incorrect data types etc. and by suggesting the best possible way to treat such data.
# Check for missing values in percentage
round(100 * application_df.isnull().mean(),2)
SK_ID_CURR 0.00 TARGET 0.00 NAME_CONTRACT_TYPE 0.00 CODE_GENDER 0.00 FLAG_OWN_CAR 0.00 FLAG_OWN_REALTY 0.00 CNT_CHILDREN 0.00 AMT_INCOME_TOTAL 0.00 AMT_CREDIT 0.00 AMT_ANNUITY 0.00 AMT_GOODS_PRICE 0.09 NAME_TYPE_SUITE 0.42 NAME_INCOME_TYPE 0.00 NAME_EDUCATION_TYPE 0.00 NAME_FAMILY_STATUS 0.00 NAME_HOUSING_TYPE 0.00 REGION_POPULATION_RELATIVE 0.00 DAYS_BIRTH 0.00 DAYS_EMPLOYED 0.00 DAYS_REGISTRATION 0.00 DAYS_ID_PUBLISH 0.00 OWN_CAR_AGE 65.99 FLAG_MOBIL 0.00 FLAG_EMP_PHONE 0.00 FLAG_WORK_PHONE 0.00 FLAG_CONT_MOBILE 0.00 FLAG_PHONE 0.00 FLAG_EMAIL 0.00 OCCUPATION_TYPE 31.35 CNT_FAM_MEMBERS 0.00 REGION_RATING_CLIENT 0.00 REGION_RATING_CLIENT_W_CITY 0.00 WEEKDAY_APPR_PROCESS_START 0.00 HOUR_APPR_PROCESS_START 0.00 REG_REGION_NOT_LIVE_REGION 0.00 REG_REGION_NOT_WORK_REGION 0.00 LIVE_REGION_NOT_WORK_REGION 0.00 REG_CITY_NOT_LIVE_CITY 0.00 REG_CITY_NOT_WORK_CITY 0.00 LIVE_CITY_NOT_WORK_CITY 0.00 ORGANIZATION_TYPE 0.00 EXT_SOURCE_1 56.38 EXT_SOURCE_2 0.21 EXT_SOURCE_3 19.83 APARTMENTS_AVG 50.75 BASEMENTAREA_AVG 58.52 YEARS_BEGINEXPLUATATION_AVG 48.78 YEARS_BUILD_AVG 66.50 COMMONAREA_AVG 69.87 ELEVATORS_AVG 53.30 ENTRANCES_AVG 50.35 FLOORSMAX_AVG 49.76 FLOORSMIN_AVG 67.85 LANDAREA_AVG 59.38 LIVINGAPARTMENTS_AVG 68.35 LIVINGAREA_AVG 50.19 NONLIVINGAPARTMENTS_AVG 69.43 NONLIVINGAREA_AVG 55.18 APARTMENTS_MODE 50.75 BASEMENTAREA_MODE 58.52 YEARS_BEGINEXPLUATATION_MODE 48.78 YEARS_BUILD_MODE 66.50 COMMONAREA_MODE 69.87 ELEVATORS_MODE 53.30 ENTRANCES_MODE 50.35 FLOORSMAX_MODE 49.76 FLOORSMIN_MODE 67.85 LANDAREA_MODE 59.38 LIVINGAPARTMENTS_MODE 68.35 LIVINGAREA_MODE 50.19 NONLIVINGAPARTMENTS_MODE 69.43 NONLIVINGAREA_MODE 55.18 APARTMENTS_MEDI 50.75 BASEMENTAREA_MEDI 58.52 YEARS_BEGINEXPLUATATION_MEDI 48.78 YEARS_BUILD_MEDI 66.50 COMMONAREA_MEDI 69.87 ELEVATORS_MEDI 53.30 ENTRANCES_MEDI 50.35 FLOORSMAX_MEDI 49.76 FLOORSMIN_MEDI 67.85 LANDAREA_MEDI 59.38 LIVINGAPARTMENTS_MEDI 68.35 LIVINGAREA_MEDI 50.19 NONLIVINGAPARTMENTS_MEDI 69.43 NONLIVINGAREA_MEDI 55.18 FONDKAPREMONT_MODE 68.39 HOUSETYPE_MODE 50.18 TOTALAREA_MODE 48.27 WALLSMATERIAL_MODE 50.84 EMERGENCYSTATE_MODE 47.40 OBS_30_CNT_SOCIAL_CIRCLE 0.33 DEF_30_CNT_SOCIAL_CIRCLE 0.33 OBS_60_CNT_SOCIAL_CIRCLE 0.33 DEF_60_CNT_SOCIAL_CIRCLE 0.33 DAYS_LAST_PHONE_CHANGE 0.00 FLAG_DOCUMENT_2 0.00 FLAG_DOCUMENT_3 0.00 FLAG_DOCUMENT_4 0.00 FLAG_DOCUMENT_5 0.00 FLAG_DOCUMENT_6 0.00 FLAG_DOCUMENT_7 0.00 FLAG_DOCUMENT_8 0.00 FLAG_DOCUMENT_9 0.00 FLAG_DOCUMENT_10 0.00 FLAG_DOCUMENT_11 0.00 FLAG_DOCUMENT_12 0.00 FLAG_DOCUMENT_13 0.00 FLAG_DOCUMENT_14 0.00 FLAG_DOCUMENT_15 0.00 FLAG_DOCUMENT_16 0.00 FLAG_DOCUMENT_17 0.00 FLAG_DOCUMENT_18 0.00 FLAG_DOCUMENT_19 0.00 FLAG_DOCUMENT_20 0.00 FLAG_DOCUMENT_21 0.00 AMT_REQ_CREDIT_BUREAU_HOUR 13.50 AMT_REQ_CREDIT_BUREAU_DAY 13.50 AMT_REQ_CREDIT_BUREAU_WEEK 13.50 AMT_REQ_CREDIT_BUREAU_MON 13.50 AMT_REQ_CREDIT_BUREAU_QRT 13.50 AMT_REQ_CREDIT_BUREAU_YEAR 13.50 dtype: float64
# Extract the column names with more than 50% data missing and their respective missing value percentage
missing50 = list(filter(lambda x: x[1] > 50, round(100 * application_df.isnull().sum() / len(application_df.index), 2).items()))
# Extract the column names from the above list
cols_to_drop = [i[0] for i in missing50]
cols_to_drop
['OWN_CAR_AGE', 'EXT_SOURCE_1', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE']
# Remove the columns with more than 50% missing values
application_df.drop(cols_to_drop, axis = 1, inplace = True)
# Check the shape
application_df.shape
(307511, 81)
# Check for % missing values for remaining columns
round(100 * application_df.isnull().sum() / len(application_df.index),2)
SK_ID_CURR 0.00 TARGET 0.00 NAME_CONTRACT_TYPE 0.00 CODE_GENDER 0.00 FLAG_OWN_CAR 0.00 FLAG_OWN_REALTY 0.00 CNT_CHILDREN 0.00 AMT_INCOME_TOTAL 0.00 AMT_CREDIT 0.00 AMT_ANNUITY 0.00 AMT_GOODS_PRICE 0.09 NAME_TYPE_SUITE 0.42 NAME_INCOME_TYPE 0.00 NAME_EDUCATION_TYPE 0.00 NAME_FAMILY_STATUS 0.00 NAME_HOUSING_TYPE 0.00 REGION_POPULATION_RELATIVE 0.00 DAYS_BIRTH 0.00 DAYS_EMPLOYED 0.00 DAYS_REGISTRATION 0.00 DAYS_ID_PUBLISH 0.00 FLAG_MOBIL 0.00 FLAG_EMP_PHONE 0.00 FLAG_WORK_PHONE 0.00 FLAG_CONT_MOBILE 0.00 FLAG_PHONE 0.00 FLAG_EMAIL 0.00 OCCUPATION_TYPE 31.35 CNT_FAM_MEMBERS 0.00 REGION_RATING_CLIENT 0.00 REGION_RATING_CLIENT_W_CITY 0.00 WEEKDAY_APPR_PROCESS_START 0.00 HOUR_APPR_PROCESS_START 0.00 REG_REGION_NOT_LIVE_REGION 0.00 REG_REGION_NOT_WORK_REGION 0.00 LIVE_REGION_NOT_WORK_REGION 0.00 REG_CITY_NOT_LIVE_CITY 0.00 REG_CITY_NOT_WORK_CITY 0.00 LIVE_CITY_NOT_WORK_CITY 0.00 ORGANIZATION_TYPE 0.00 EXT_SOURCE_2 0.21 EXT_SOURCE_3 19.83 YEARS_BEGINEXPLUATATION_AVG 48.78 FLOORSMAX_AVG 49.76 YEARS_BEGINEXPLUATATION_MODE 48.78 FLOORSMAX_MODE 49.76 YEARS_BEGINEXPLUATATION_MEDI 48.78 FLOORSMAX_MEDI 49.76 TOTALAREA_MODE 48.27 EMERGENCYSTATE_MODE 47.40 OBS_30_CNT_SOCIAL_CIRCLE 0.33 DEF_30_CNT_SOCIAL_CIRCLE 0.33 OBS_60_CNT_SOCIAL_CIRCLE 0.33 DEF_60_CNT_SOCIAL_CIRCLE 0.33 DAYS_LAST_PHONE_CHANGE 0.00 FLAG_DOCUMENT_2 0.00 FLAG_DOCUMENT_3 0.00 FLAG_DOCUMENT_4 0.00 FLAG_DOCUMENT_5 0.00 FLAG_DOCUMENT_6 0.00 FLAG_DOCUMENT_7 0.00 FLAG_DOCUMENT_8 0.00 FLAG_DOCUMENT_9 0.00 FLAG_DOCUMENT_10 0.00 FLAG_DOCUMENT_11 0.00 FLAG_DOCUMENT_12 0.00 FLAG_DOCUMENT_13 0.00 FLAG_DOCUMENT_14 0.00 FLAG_DOCUMENT_15 0.00 FLAG_DOCUMENT_16 0.00 FLAG_DOCUMENT_17 0.00 FLAG_DOCUMENT_18 0.00 FLAG_DOCUMENT_19 0.00 FLAG_DOCUMENT_20 0.00 FLAG_DOCUMENT_21 0.00 AMT_REQ_CREDIT_BUREAU_HOUR 13.50 AMT_REQ_CREDIT_BUREAU_DAY 13.50 AMT_REQ_CREDIT_BUREAU_WEEK 13.50 AMT_REQ_CREDIT_BUREAU_MON 13.50 AMT_REQ_CREDIT_BUREAU_QRT 13.50 AMT_REQ_CREDIT_BUREAU_YEAR 13.50 dtype: float64
application_df[['AMT_REQ_CREDIT_BUREAU_HOUR','AMT_REQ_CREDIT_BUREAU_DAY','AMT_REQ_CREDIT_BUREAU_WEEK',
'AMT_REQ_CREDIT_BUREAU_MON','AMT_REQ_CREDIT_BUREAU_QRT','AMT_REQ_CREDIT_BUREAU_YEAR']].describe()
| AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|
| count | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 |
| mean | 0.006402 | 0.007000 | 0.034362 | 0.267395 | 0.265474 | 1.899974 |
| std | 0.083849 | 0.110757 | 0.204685 | 0.916002 | 0.794056 | 1.869295 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 75% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.000000 |
| max | 4.000000 | 9.000000 | 8.000000 | 27.000000 | 261.000000 | 25.000000 |
There are certain columns in the data set which have incorrect data types. We can change them to appropriate data type.
Note: We cannot perform the data type change for some columns until we actually impute the missing data. In that case only suggestion is provided.
# Changing DAYS_REGISTRATION column data type to int
application_df['DAYS_REGISTRATION'] = application_df['DAYS_REGISTRATION'].astype(int)
# Check if any missing value in CNT_FAM_MEMBERS column
#application_df['CNT_FAM_MEMBERS'].isnull().sum()
# Replacing missing with median value for the CNT_FAM_MEMBERS column
#application_df['CNT_FAM_MEMBERS'].fillna(application_df['CNT_FAM_MEMBERS'].median(),inplace = True)
# Changing CNT_FAM_MEMBERS column data type to int
#application_df['CNT_FAM_MEMBERS'] = application_df['CNT_FAM_MEMBERS'].astype(int)
# We can convert these DAYS columns into int data type as it is anyway going to be a whole number.
col_list = ['DAYS_BIRTH','DAYS_EMPLOYED','DAYS_REGISTRATION','DAYS_ID_PUBLISH']
for i in col_list:
application_df[i] = application_df[i].astype(int)
# We have NOT converted the data type of DAYS_LAST_PHONE_CHANGE
# as it has some missing values which need to be treated (replace by median) first.
# Verify the changes
application_df[['DAYS_BIRTH','DAYS_EMPLOYED','DAYS_REGISTRATION','DAYS_ID_PUBLISH','CNT_FAM_MEMBERS','DAYS_REGISTRATION']].info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 307511 entries, 0 to 307510 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 DAYS_BIRTH 307511 non-null int64 1 DAYS_EMPLOYED 307511 non-null int64 2 DAYS_REGISTRATION 307511 non-null int64 3 DAYS_ID_PUBLISH 307511 non-null int64 4 CNT_FAM_MEMBERS 307509 non-null float64 5 DAYS_REGISTRATION 307511 non-null int64 dtypes: float64(1), int64(5) memory usage: 14.1 MB
# Inspect the negative values in the DAYS columns
application_df[['DAYS_BIRTH','DAYS_EMPLOYED','DAYS_REGISTRATION','DAYS_ID_PUBLISH','DAYS_LAST_PHONE_CHANGE']].describe()
| DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | DAYS_LAST_PHONE_CHANGE | |
|---|---|---|---|---|---|
| count | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307510.000000 |
| mean | -16036.995067 | 63815.045904 | -4986.120327 | -2994.202373 | -962.858788 |
| std | 4363.988632 | 141275.766519 | 3522.886321 | 1509.450419 | 826.808487 |
| min | -25229.000000 | -17912.000000 | -24672.000000 | -7197.000000 | -4292.000000 |
| 25% | -19682.000000 | -2760.000000 | -7479.500000 | -4299.000000 | -1570.000000 |
| 50% | -15750.000000 | -1213.000000 | -4504.000000 | -3254.000000 | -757.000000 |
| 75% | -12413.000000 | -289.000000 | -2010.000000 | -1720.000000 | -274.000000 |
| max | -7489.000000 | 365243.000000 | 0.000000 | 0.000000 | 0.000000 |
# Make a list of all DAYS columns
col_list = ['DAYS_BIRTH','DAYS_EMPLOYED','DAYS_REGISTRATION','DAYS_ID_PUBLISH','DAYS_LAST_PHONE_CHANGE']
# Replace the values with their respective absolute values
for i in col_list:
application_df[i] = abs(application_df[i])
# Verify the changes
application_df[['DAYS_BIRTH','DAYS_EMPLOYED','DAYS_REGISTRATION','DAYS_ID_PUBLISH','DAYS_LAST_PHONE_CHANGE']].describe()
| DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | DAYS_LAST_PHONE_CHANGE | |
|---|---|---|---|---|---|
| count | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307510.000000 |
| mean | 16036.995067 | 67724.742149 | 4986.120327 | 2994.202373 | 962.858788 |
| std | 4363.988632 | 139443.751806 | 3522.886321 | 1509.450419 | 826.808487 |
| min | 7489.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 12413.000000 | 933.000000 | 2010.000000 | 1720.000000 | 274.000000 |
| 50% | 15750.000000 | 2219.000000 | 4504.000000 | 3254.000000 | 757.000000 |
| 75% | 19682.000000 | 5707.000000 | 7479.500000 | 4299.000000 | 1570.000000 |
| max | 25229.000000 | 365243.000000 | 24672.000000 | 7197.000000 | 4292.000000 |
Note: We can create a new column based on DAYS_BIRTH to show the age of the applicant for better readability and then we can drop the DAYS_BIRTH column. Similarly we can convert the other DAYS columns to represent the value in years.
# application_df['AGE'] = application_df['DAYS_BIRTH'] // 365
# application_df['YEAR_IN_SERVICE'] = application_df['DAYS_EMPLOYED'] // 365
# application_df['BANK_MEMBERSHIP_DURATION'] = application_df['DAYS_REGISTRATION'] // 365
application_df['CODE_GENDER'].value_counts()
CODE_GENDER F 202448 M 105059 XNA 4 Name: count, dtype: int64
# Get rid of improper value XNA by replacing it with NaN - not using mode as that would be imputation
application_df['CODE_GENDER'] = application_df['CODE_GENDER'].replace('XNA',np.nan)
# Verify again
application_df['CODE_GENDER'].value_counts()
CODE_GENDER F 202448 M 105059 Name: count, dtype: int64
As we know, there are possibilities of having exceptionally low or high values in our data termed as outliers. It is of very much importance to identify such data points and get the data treated to avoid wrong interpretation. We are going to consider the below columns for outlier analysis.
# Defining a function to plot outliers
def outlier_plot(var,title,label):
plt.figure(figsize = [8,5])
plt.title(title, fontdict={'fontsize': 12, 'fontweight' : 5, 'color' : 'Brown'})
sns.boxplot(y = var)
plt.ylabel(label, fontdict={'fontsize': 10, 'fontweight' : 5, 'color' : 'Grey'})
plt.show()
# Ploting boxplot on AMT_INCOME_TOTAL for outlier analysis
var = application_df['AMT_INCOME_TOTAL']/100000
title = "Client's income"
label = 'Income in Lakhs'
outlier_plot(var,title,label)
# Describe to check the summary
(application_df['AMT_INCOME_TOTAL']/100000).describe()
count 307511.000000 mean 1.687979 std 2.371231 min 0.256500 25% 1.125000 50% 1.471500 75% 2.025000 max 1170.000000 Name: AMT_INCOME_TOTAL, dtype: float64
# print the quantile (0.5, 0.7, 0.9, 0.95 and 0.99) of AMT_INCOME_TOTAL
(application_df['AMT_INCOME_TOTAL']/100000).quantile([0.5, 0.7, 0.9, 0.95, 0.99])
0.50 1.4715 0.70 1.8000 0.90 2.7000 0.95 3.3750 0.99 4.7250 Name: AMT_INCOME_TOTAL, dtype: float64
# Ploting boxplot on AMT_CREDIT for outlier analysis
var = application_df['AMT_CREDIT']/100000
title = "Credit amount of the loan"
label = "Amount in Lakhs"
outlier_plot(var,title,label)
# Describe to check the summary
(application_df['AMT_CREDIT']/100000).describe()
count 307511.000000 mean 5.990260 std 4.024908 min 0.450000 25% 2.700000 50% 5.135310 75% 8.086500 max 40.500000 Name: AMT_CREDIT, dtype: float64
# print the quantile (0.5, 0.7, 0.9, 0.95 and 0.99) of AMT_CREDIT
(application_df['AMT_CREDIT']/100000).quantile([0.5, 0.7, 0.9, 0.95, 0.99])
0.50 5.13531 0.70 7.55190 0.90 11.33748 0.95 13.50000 0.99 18.54000 Name: AMT_CREDIT, dtype: float64
# Ploting boxplot on AMT_ANNUITY for outlier analysis
var = application_df['AMT_ANNUITY']/100000
title = "Loan annuity"
label = "Loan Annuity in Lakhs"
outlier_plot(var,title,label)
# Describe to check the summary
(application_df['AMT_ANNUITY']/100000).describe()
count 307499.000000 mean 0.271086 std 0.144937 min 0.016155 25% 0.165240 50% 0.249030 75% 0.345960 max 2.580255 Name: AMT_ANNUITY, dtype: float64
# Ploting boxplot on AMT_GOODS_PRICE for outlier analysis
var = application_df['AMT_GOODS_PRICE']/100000
title = "Goods Price"
label = "Amount in Lakhs"
outlier_plot(var,title,label)
# Describe to check the summary
(application_df['AMT_GOODS_PRICE']/100000).describe()
count 307233.000000 mean 5.383962 std 3.694465 min 0.405000 25% 2.385000 50% 4.500000 75% 6.795000 max 40.500000 Name: AMT_GOODS_PRICE, dtype: float64
# Describe to check the summary
(application_df['AMT_GOODS_PRICE']/100000).quantile([0.5, 0.7, 0.9, 0.95, 0.99])
0.50 4.500 0.70 6.750 0.90 10.935 0.95 13.050 0.99 18.000 Name: AMT_GOODS_PRICE, dtype: float64
# Ploting boxplot on DAYS_BIRTH for outlier analysis
var = application_df['DAYS_BIRTH']//365
title = "Client's age"
label = "Age in years"
outlier_plot(var,title,label)
# Ploting boxplot on DAYS_EMPLOYED for outlier analysis
var = application_df['DAYS_EMPLOYED']//365
title = "Employement duration"
label = "Years in Service"
outlier_plot(var,title,label)
# Describe to check the summary
(application_df['DAYS_EMPLOYED']/365).describe()
count 307511.000000 mean 185.547239 std 382.037676 min 0.000000 25% 2.556164 50% 6.079452 75% 15.635616 max 1000.665753 Name: DAYS_EMPLOYED, dtype: float64
# print the quantile (0.5, 0.7, 0.9, 0.95 and 0.99) of DAYS_EMPLOYED
(application_df['DAYS_EMPLOYED']/365).quantile([0.5, 0.7, 0.9, 0.95, 0.99])
0.50 6.079452 0.70 12.150685 0.90 1000.665753 0.95 1000.665753 0.99 1000.665753 Name: DAYS_EMPLOYED, dtype: float64
(application_df['DAYS_EMPLOYED']/365).quantile([0.5, 0.7, 0.8,0.85, 0.9])
0.50 6.079452 0.70 12.150685 0.80 25.172603 0.85 1000.665753 0.90 1000.665753 Name: DAYS_EMPLOYED, dtype: float64
# Ploting boxplot on DAYS_REGISTRATION for outlier analysis
var = application_df['DAYS_REGISTRATION']//365
title = "Bank membership duration"
label = "Registered for in years"
outlier_plot(var,title,label)
# Describe to check the summary
(application_df['DAYS_REGISTRATION']/365).describe()
count 307511.000000 mean 13.660604 std 9.651743 min 0.000000 25% 5.506849 50% 12.339726 75% 20.491781 max 67.594521 Name: DAYS_REGISTRATION, dtype: float64
# Check the Age Summary - Since DAYS_BIRTH is in days, we have divided it by 365 to get it in years
(application_df['DAYS_BIRTH'] // 365).describe()
count 307511.000000 mean 43.435968 std 11.954593 min 20.000000 25% 34.000000 50% 43.000000 75% 53.000000 max 69.000000 Name: DAYS_BIRTH, dtype: float64
# Binning DAYS_BIRTH based on above summary
bins = [0,20,30,40,50,60,100]
labels = ['Below 20','20-30','30-40','40-50','50-60','Above 60']
application_df['AGE_GROUP'] = pd.cut(application_df['DAYS_BIRTH'] // 365, bins = bins, labels = labels )
# Checking the values
application_df['AGE_GROUP'].value_counts().plot(kind='bar')
plt.title("No. of Loan Applicants Vs Age Group\n", fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
plt.ylabel('No. of applicants', fontdict={'fontsize': 12, 'fontweight' : 5, 'color' : 'Grey'})
plt.xlabel('Age Group', fontdict={'fontsize': 12, 'fontweight' : 5, 'color' : 'Grey'})
plt.xticks(rotation=30)
plt.show()
Note: We can make the AMT_INCOME_TOTAL data more readable by changing the unit to lakhs.
# Check the Total income summary - we can divide it by 100,000 for better readability
(application_df['AMT_INCOME_TOTAL']/100000).describe()
count 307511.000000 mean 1.687979 std 2.371231 min 0.256500 25% 1.125000 50% 1.471500 75% 2.025000 max 1170.000000 Name: AMT_INCOME_TOTAL, dtype: float64
# Binning AMT_INCOME_TOTAL based on above summary
bins = [0,1,2,5,10,20,50,1000]
labels = ['Upto 1L','1-2L','2-5L','5-10L','10-20L','20-50L','50L above']
application_df['INCOME_GROUP'] = pd.cut(application_df['AMT_INCOME_TOTAL'] / 100000, bins = bins, labels = labels )
# Checking the values
application_df['INCOME_GROUP'].value_counts().plot(kind='bar')
plt.title("Number of Applications Vs Income Group\n", fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
plt.ylabel('No. of applicants', fontdict={'fontsize': 12, 'fontweight' : 5, 'color' : 'Grey'})
plt.xlabel('Income Group', fontdict={'fontsize': 12, 'fontweight' : 5, 'color' : 'Grey'})
plt.xticks(rotation=30)
plt.show()
# Check the credit amount of the loan - we can divide it by 100,000 for better readability
(application_df['AMT_CREDIT']/100000).describe()
count 307511.000000 mean 5.990260 std 4.024908 min 0.450000 25% 2.700000 50% 5.135310 75% 8.086500 max 40.500000 Name: AMT_CREDIT, dtype: float64
# Binning AMT_ANNUITY based on above summary
bins = [0,1,5,10,20,30,40,50,100]
labels = ['Upto 1L','1-5L','5-10L','10-20L','20-30L','30-40L','40-50L','50L above']
application_df['CREDIT_GROUP'] = pd.cut(application_df['AMT_CREDIT'] / 100000, bins = bins, labels = labels )
# Checking the values
application_df['CREDIT_GROUP'].value_counts().plot(kind='bar')
plt.title("Number of Applications Vs Credit Group\n", fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
plt.ylabel('No. of applicants', fontdict={'fontsize': 12, 'fontweight' : 5, 'color' : 'Grey'})
plt.xlabel('Credit Group', fontdict={'fontsize': 12, 'fontweight' : 5, 'color' : 'Grey'})
plt.xticks(rotation=30)
plt.show()
What is Imbalance Percentage?
In our data set, there is a target variable/column named 'TARGET'. It represents whether the client is a defaulter or not. If we segregate our dataset based on this column, and if the distribution turns out to be 50-50 i.e. 50% of the applicants are defaluters and the rest 50% are NOT, then our data set would be BALANCED. In any other case, it would be considered as IMBALANCED.
# Checking imbalance percentage
application_df['TARGET'].value_counts(normalize = True)*100
TARGET 0 91.927118 1 8.072882 Name: proportion, dtype: float64
# Plotting imbalance percentage
#Extracting the imbalance percentage
Repayment_Status = application_df['TARGET'].value_counts(normalize=True)*100
# Defining the x values
x= ['Others','Defaulters']
# Defining the y ticks
axes= plt.axes()
axes.set_ylim([0,100])
axes.set_yticks([10,20,30,40,50,60,70,80,90,100])
sns.barplot(x=x, y=Repayment_Status)
# Adding plot title, and x & y labels
plt.title('Imbalance Percentage\n', fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
plt.xlabel("Borrower Category")
plt.ylabel("Percentage")
# Displaying the plot
plt.show()
Now, let's create 2 data sets to segregate our original data based on the TARGET column values to have defaulters in one dataframe and others in another.
# Creating data frame of Others
application_df0 = application_df[application_df['TARGET']==0]
application_df0.head()
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_2 | EXT_SOURCE_3 | YEARS_BEGINEXPLUATATION_AVG | FLOORSMAX_AVG | YEARS_BEGINEXPLUATATION_MODE | FLOORSMAX_MODE | YEARS_BEGINEXPLUATATION_MEDI | FLOORSMAX_MEDI | TOTALAREA_MODE | EMERGENCYSTATE_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | AGE_GROUP | INCOME_GROUP | CREDIT_GROUP | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | Family | State servant | Higher education | Married | House / apartment | 0.003541 | 16765 | 1188 | 1186 | 291 | 1 | 1 | 0 | 1 | 1 | 0 | Core staff | 2.0 | 1 | 1 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | School | 0.622246 | NaN | 0.9851 | 0.2917 | 0.9851 | 0.2917 | 0.9851 | 0.2917 | 0.0714 | No | 1.0 | 0.0 | 1.0 | 0.0 | 828.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 40-50 | 2-5L | 10-20L |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | 135000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.010032 | 19046 | 225 | 4260 | 2531 | 1 | 1 | 1 | 1 | 1 | 0 | Laborers | 1.0 | 2 | 2 | MONDAY | 9 | 0 | 0 | 0 | 0 | 0 | 0 | Government | 0.555912 | 0.729567 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 815.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 50-60 | Upto 1L | 1-5L |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | 297000.0 | Unaccompanied | Working | Secondary / secondary special | Civil marriage | House / apartment | 0.008019 | 19005 | 3039 | 9833 | 2437 | 1 | 1 | 0 | 1 | 0 | 0 | Laborers | 2.0 | 2 | 2 | WEDNESDAY | 17 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.650442 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2.0 | 0.0 | 2.0 | 0.0 | 617.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 50-60 | 1-2L | 1-5L |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | 513000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.028663 | 19932 | 3038 | 4311 | 3458 | 1 | 1 | 0 | 1 | 0 | 0 | Core staff | 1.0 | 2 | 2 | THURSDAY | 11 | 0 | 0 | 0 | 0 | 1 | 1 | Religion | 0.322738 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 1106.0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 50-60 | 1-2L | 5-10L |
| 5 | 100008 | 0 | Cash loans | M | N | Y | 0 | 99000.0 | 490495.5 | 27517.5 | 454500.0 | Spouse, partner | State servant | Secondary / secondary special | Married | House / apartment | 0.035792 | 16941 | 1588 | 4970 | 477 | 1 | 1 | 1 | 1 | 1 | 0 | Laborers | 2.0 | 2 | 2 | WEDNESDAY | 16 | 0 | 0 | 0 | 0 | 0 | 0 | Other | 0.354225 | 0.621226 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 2536.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 40-50 | Upto 1L | 1-5L |
# Creating data frame of Defaulters
application_df1 = application_df[application_df['TARGET']==1]
application_df1.head()
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_2 | EXT_SOURCE_3 | YEARS_BEGINEXPLUATATION_AVG | FLOORSMAX_AVG | YEARS_BEGINEXPLUATATION_MODE | FLOORSMAX_MODE | YEARS_BEGINEXPLUATATION_MEDI | FLOORSMAX_MEDI | TOTALAREA_MODE | EMERGENCYSTATE_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | AGE_GROUP | INCOME_GROUP | CREDIT_GROUP | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | 351000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.018801 | 9461 | 637 | 3648 | 2120 | 1 | 1 | 0 | 1 | 1 | 0 | Laborers | 1.0 | 2 | 2 | WEDNESDAY | 10 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.262949 | 0.139376 | 0.9722 | 0.0833 | 0.9722 | 0.0833 | 0.9722 | 0.0833 | 0.0149 | No | 2.0 | 2.0 | 2.0 | 2.0 | 1134.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 20-30 | 2-5L | 1-5L |
| 26 | 100031 | 1 | Cash loans | F | N | Y | 0 | 112500.0 | 979992.0 | 27076.5 | 702000.0 | Unaccompanied | Working | Secondary / secondary special | Widow | House / apartment | 0.018029 | 18724 | 2628 | 6573 | 1827 | 1 | 1 | 0 | 1 | 0 | 0 | Cooking staff | 1.0 | 3 | 2 | MONDAY | 9 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.548477 | 0.190706 | 0.9732 | 0.0417 | 0.9732 | 0.0417 | 0.9732 | 0.0417 | 0.0085 | Yes | 10.0 | 1.0 | 10.0 | 0.0 | 161.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 2.0 | 50-60 | 1-2L | 5-10L |
| 40 | 100047 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 1193580.0 | 35028.0 | 855000.0 | Unaccompanied | Commercial associate | Secondary / secondary special | Married | House / apartment | 0.025164 | 17482 | 1262 | 1182 | 1029 | 1 | 1 | 0 | 1 | 0 | 0 | Laborers | 2.0 | 2 | 2 | TUESDAY | 9 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.306841 | 0.320163 | 0.9960 | 0.2500 | 0.9960 | 0.2500 | 0.9960 | 0.2500 | 0.1463 | No | 0.0 | 0.0 | 0.0 | 0.0 | 1075.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 4.0 | 40-50 | 2-5L | 10-20L |
| 42 | 100049 | 1 | Cash loans | F | N | N | 0 | 135000.0 | 288873.0 | 16258.5 | 238500.0 | Unaccompanied | Working | Secondary / secondary special | Civil marriage | House / apartment | 0.007305 | 13384 | 3597 | 45 | 4409 | 1 | 1 | 1 | 1 | 1 | 0 | Sales staff | 2.0 | 3 | 3 | THURSDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | Self-employed | 0.674203 | 0.399676 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | 0.0 | 1.0 | 0.0 | 1480.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 30-40 | 1-2L | 1-5L |
| 81 | 100096 | 1 | Cash loans | F | N | Y | 0 | 81000.0 | 252000.0 | 14593.5 | 252000.0 | Unaccompanied | Pensioner | Secondary / secondary special | Married | House / apartment | 0.028663 | 24794 | 365243 | 5391 | 4199 | 1 | 0 | 0 | 1 | 0 | 0 | NaN | 2.0 | 2 | 2 | THURSDAY | 10 | 0 | 0 | 0 | 0 | 0 | 0 | XNA | 0.023952 | 0.720944 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | Above 60 | Upto 1L | 1-5L |
We will plot graphs of the below categorical variables to draw inferences-
def univariate_categorical_plot(category1, category2, xlabel):
plt.figure(figsize = [15,7])
plt.subplot(1,2,1)
sns.countplot(category1)
plt.title('Defaulters\n', fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
plt.xlabel(xlabel)
plt.xticks(rotation=45, ha='right')
plt.subplot(1,2,2)
sns.countplot(category2)
plt.title('Others\n', fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
plt.xlabel(xlabel)
plt.xticks(rotation=45, ha='right')
plt.show()
# Defining a function to plot defaulter percentage against univariate categorical variable
def perc_defaulter(col1, col2, title, xlabel):
tempdf = application_df[[col1,col2]].groupby([col2], as_index=False).mean()
tempdf[col1] = tempdf[col1]*100
tempdf.sort_values(by=col1, ascending=False, inplace=True)
sns.barplot(x=col2, y = col1, data = tempdf)
plt.title(title, fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
plt.xlabel(xlabel)
plt.ylabel('Defaulter %')
plt.xticks(rotation=45, ha='right')
plt.show()
import seaborn as sns
import matplotlib.pyplot as plt
# Assuming category1 and category2 are your categorical variables
plt.figure(figsize=[15, 7])
plt.subplot(1, 2, 1)
sns.countplot(data=application_df0, x='NAME_CONTRACT_TYPE')
plt.title('Defaulters\n', fontdict={'fontsize': 20, 'fontweight': 5, 'color': 'Brown'})
plt.xlabel('Contract Type')
plt.show()
# Plot the percentage of defaulters in each category
col1 = 'TARGET'
col2 = 'NAME_CONTRACT_TYPE'
title = 'Contract Type Vs Default Percentage\n'
xlabel = 'Contract Type'
perc_defaulter(col1, col2, title, xlabel)
def univariate_categorical_plot(category1, category2, xlabel):
plt.figure(figsize=[15, 7])
plt.subplot(1, 2, 1)
sns.countplot(x=category1)
plt.title('Defaulters\n', fontdict={'fontsize': 20, 'fontweight': 5, 'color': 'Brown'})
plt.xlabel(xlabel)
plt.subplot(1, 2, 2)
sns.countplot(x=category2)
plt.title('Non-Defaulters\n', fontdict={'fontsize': 20, 'fontweight': 5, 'color': 'Brown'})
plt.xlabel(xlabel)
plt.show()
# Plot the percentage of defaulters in each category
col1 = 'TARGET'
col2 = 'CODE_GENDER'
title = 'Gender Vs Defalut percentage\n'
xlabel = 'Gender'
perc_defaulter(col1, col2, title, xlabel)
def univariate_categorical_plot(category1, category2, xlabel):
plt.figure(figsize=[15, 7])
plt.subplot(1, 2, 1)
sns.countplot(x=category1)
plt.title('Defaulters\n', fontdict={'fontsize': 20, 'fontweight': 5, 'color': 'Brown'})
plt.xlabel(xlabel)
plt.subplot(1, 2, 2)
sns.countplot(x=category2)
plt.title('Non-Defaulters\n', fontdict={'fontsize': 20, 'fontweight': 5, 'color': 'Brown'})
plt.xlabel(xlabel)
plt.show()
# Plot the percentage of defaulters in each category
col1 = 'TARGET'
col2 = 'OCCUPATION_TYPE'
title = 'Occupation Vs Default percentage\n'
xlabel = 'Occupation'
perc_defaulter(col1, col2, title, xlabel)
import seaborn as sns
import matplotlib.pyplot as plt
# Call the function with all required arguments
category1 = application_df1['NAME_INCOME_TYPE']
category2 = application_df0['NAME_INCOME_TYPE']
xlabel = 'Income Source'
univariate_categorical_plot(category1, category2, xlabel)
# Plot the percentage of defaulters in each category
col1 = 'TARGET'
col2 = 'NAME_INCOME_TYPE'
title = 'Income type Vs Default percentage\n'
xlabel = 'Income type'
perc_defaulter(col1, col2, title, xlabel)
# Analyzing w.r.t Education Type column
category1 = application_df1['NAME_EDUCATION_TYPE']
category2 = application_df0['NAME_EDUCATION_TYPE']
xlabel = 'Education'
univariate_categorical_plot(category1, category2, xlabel)
# Plot the percentage of defaulters in each category
col1 = 'TARGET'
col2 = 'NAME_EDUCATION_TYPE'
title = 'Education Vs Default percentage\n'
xlabel = 'Education'
perc_defaulter(col1, col2, title, xlabel)
# Analyzing w.r.t Family Status Type column
category1 = application_df1['NAME_FAMILY_STATUS']
category2 = application_df0['NAME_FAMILY_STATUS']
xlabel = 'Family Status'
univariate_categorical_plot(category1, category2, xlabel)
# Plot the percentage of defaulters in each category
col1 = 'TARGET'
col2 = 'NAME_FAMILY_STATUS'
title = 'Family Status Vs Defalut percentage\n'
xlabel = 'Family Status'
perc_defaulter(col1, col2, title, xlabel)
# Analyzing w.r.t Housing Type Type column
category1 = application_df1['NAME_HOUSING_TYPE']
category2 = application_df0['NAME_HOUSING_TYPE']
xlabel = 'Housing Type'
univariate_categorical_plot(category1, category2, xlabel)
# Plot the percentage of defaulters in each category
col1 = 'TARGET'
col2 = 'NAME_HOUSING_TYPE'
title = 'House Type Vs Default percentage\n'
xlabel = 'House Type'
perc_defaulter(col1, col2, title, xlabel)
# Analyzing w.r.t Income Group column
category1 = application_df1['INCOME_GROUP']
category2 = application_df0['INCOME_GROUP']
xlabel = 'Income Group'
univariate_categorical_plot(category1, category2, xlabel)
# Plot the percentage of defaulters in each category
col1 = 'TARGET'
col2 = 'INCOME_GROUP'
title = 'Income Group Vs Default percentage\n'
xlabel = 'Income Group'
perc_defaulter(col1, col2, title, xlabel)
# Analyzing w.r.t Income Group column
category1 = application_df1['AGE_GROUP']
category2 = application_df0['AGE_GROUP']
xlabel = 'Age Group'
univariate_categorical_plot(category1, category2, xlabel)
# Plot the percentage of defaulters in each category
col1 = 'TARGET'
col2 = 'AGE_GROUP'
title = 'Age Group Vs Defalut percentage\n'
xlabel = 'Age Group'
perc_defaulter(col1, col2, title, xlabel)
Here we are going to consider the below numeric columns and draw are conclusion on them.
# Defining a function to plot univariate numerical columns
def univariate_numerical_plots(col1, col2, title, xlabel):
sns.distplot(col1 , hist=False, label='Defaulters')
sns.distplot(col2 , hist=False, label='Others')
plt.title(title, fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
plt.xlabel(xlabel)
plt.legend()
plt.show()
# Plotting AMT_INCOME_TOTAL
col1 = application_df1['AMT_INCOME_TOTAL']/100000
col2 = application_df0['AMT_INCOME_TOTAL']/100000
title = 'Total Income of the client\n'
xlabel = 'Total income in lakhs'
univariate_numerical_plots(col1, col2, title, xlabel)
# Plotting AMT_CREDIT
col1 = application_df1['AMT_CREDIT']/100000
col2 = application_df0['AMT_CREDIT']/100000
title = 'Credit amount\n'
xlabel = 'Credit amount in lakhs'
univariate_numerical_plots(col1, col2, title, xlabel)
# Plotting AMT_ANNUITY
col1 = application_df1['AMT_ANNUITY']/100000
col2 = application_df0['AMT_ANNUITY']/100000
title = 'Annuity\n'
xlabel = 'Annuity in lakhs'
univariate_numerical_plots(col1, col2, title, xlabel)
# Plotting AMT_GOODS_PRICE
col1 = application_df1['AMT_GOODS_PRICE']/100000
col2 = application_df0['AMT_GOODS_PRICE']/100000
title = 'Goods Price\n'
xlabel = 'Goods price in lakhs'
univariate_numerical_plots(col1, col2, title, xlabel)
# Plotting CNT_CHILDREN
sns.distplot(application_df1['CNT_CHILDREN'],hist=False, label='Defaulters')
sns.distplot(application_df0['CNT_CHILDREN'],hist=False, label='Others')
xlabel = 'Children'
ticks = [0,1,2,3,4,5,6,7,8,9,10,11]
plt.xticks(ticks)
plt.legend()
plt.title('Count of children\n', fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
plt.show()
# Plotting DAYS_BIRTH
col1 = application_df1['DAYS_BIRTH']//365
col2 = application_df0['DAYS_BIRTH']//365
title = 'Age\n'
xlabel = 'Age in years'
univariate_numerical_plots(col1, col2, title, xlabel)
corr_df1 = application_df1[['AMT_INCOME_TOTAL','AMT_CREDIT','AMT_ANNUITY','AMT_GOODS_PRICE','DAYS_BIRTH','DAYS_EMPLOYED','CNT_CHILDREN']].corr()
corr_df1
| AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | DAYS_BIRTH | DAYS_EMPLOYED | CNT_CHILDREN | |
|---|---|---|---|---|---|---|---|
| AMT_INCOME_TOTAL | 1.000000 | 0.038131 | 0.046421 | 0.037583 | -0.003096 | -0.014977 | 0.004796 |
| AMT_CREDIT | 0.038131 | 1.000000 | 0.752195 | 0.983103 | 0.135316 | 0.001930 | -0.001675 |
| AMT_ANNUITY | 0.046421 | 0.752195 | 1.000000 | 0.752699 | 0.014303 | -0.081207 | 0.031257 |
| AMT_GOODS_PRICE | 0.037583 | 0.983103 | 0.752699 | 1.000000 | 0.135810 | 0.006642 | -0.008112 |
| DAYS_BIRTH | -0.003096 | 0.135316 | 0.014303 | 0.135810 | 1.000000 | 0.582185 | -0.259109 |
| DAYS_EMPLOYED | -0.014977 | 0.001930 | -0.081207 | 0.006642 | 0.582185 | 1.000000 | -0.192864 |
| CNT_CHILDREN | 0.004796 | -0.001675 | 0.031257 | -0.008112 | -0.259109 | -0.192864 | 1.000000 |
corr_df0 = application_df0[['AMT_INCOME_TOTAL','AMT_CREDIT','AMT_ANNUITY','AMT_GOODS_PRICE','DAYS_BIRTH','DAYS_EMPLOYED','CNT_CHILDREN']].corr()
corr_df0
| AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | DAYS_BIRTH | DAYS_EMPLOYED | CNT_CHILDREN | |
|---|---|---|---|---|---|---|---|
| AMT_INCOME_TOTAL | 1.000000 | 0.342799 | 0.418953 | 0.349462 | -0.062609 | -0.140392 | 0.027397 |
| AMT_CREDIT | 0.342799 | 1.000000 | 0.771309 | 0.987250 | 0.047378 | -0.070104 | 0.003081 |
| AMT_ANNUITY | 0.418953 | 0.771309 | 1.000000 | 0.776686 | -0.012263 | -0.104978 | 0.020905 |
| AMT_GOODS_PRICE | 0.349462 | 0.987250 | 0.776686 | 1.000000 | 0.044565 | -0.068609 | -0.000525 |
| DAYS_BIRTH | -0.062609 | 0.047378 | -0.012263 | 0.044565 | 1.000000 | 0.626114 | -0.336966 |
| DAYS_EMPLOYED | -0.140392 | -0.070104 | -0.104978 | -0.068609 | 0.626114 | 1.000000 | -0.245174 |
| CNT_CHILDREN | 0.027397 | 0.003081 | 0.020905 | -0.000525 | -0.336966 | -0.245174 | 1.000000 |
# Plot correlation heatmap for numerical variables
plt.figure(figsize=[20,10])
plt.subplot(1,2,1)
sns.heatmap(corr_df1, cmap="YlGnBu", annot = True)
plt.title('Correlation - Defaulters\n', fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
plt.xticks(rotation=45)
plt.subplot(1,2,2)
sns.heatmap(corr_df0, cmap="YlGnBu", annot = True)
plt.title('Correlation - Others\n', fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
plt.xticks(rotation=45)
plt.show()
We will perform 3 types of bivariate analysis to understand the data better and draw some important insights.
Columns considered -
# Defining function for categorical - categorical variable plotting
def cat_cat_plot(var1, var2, label, legend):
plt.figure(figsize=[15, 7])
plt.subplot(1, 2, 1)
sns.countplot(x=var1, hue=var2, data=application_df1)
plt.title('Defaulters\n', fontdict={'fontsize': 20, 'fontweight': 5, 'color': 'Brown'})
plt.xlabel(label)
plt.xticks(rotation=45)
plt.subplot(1, 2, 2)
sns.countplot(x=var1, hue=var2, data=application_df0)
plt.title('Non-Defaulters\n', fontdict={'fontsize': 20, 'fontweight': 5, 'color': 'Brown'})
plt.xlabel(label)
plt.xticks(rotation=45)
plt.show()
# NAME_CONTRACT_TYPE - CODE_GENDER
var1 = 'NAME_CONTRACT_TYPE'
var2 = 'CODE_GENDER'
label = 'Contract Type'
legend = 'Gender'
cat_cat_plot(var1, var2, label, legend)
# NAME_INCOME_TYPE - NAME_CONTRACT_TYPE
var1 = 'NAME_INCOME_TYPE'
var2 = 'NAME_CONTRACT_TYPE'
label = 'Income Type'
legend = 'Contract type'
cat_cat_plot(var1, var2, label, legend)
# INCOME_GROUP - CODE_GENDER
var1 = 'INCOME_GROUP'
var2 = 'CODE_GENDER'
label = 'Income Group'
legend = 'Gender'
cat_cat_plot(var1, var2, label, legend)
# CODE_GENDER - FLAG_OWN_REALTY
var1 = 'CODE_GENDER'
var2 = 'FLAG_OWN_REALTY'
label = 'Gender'
legend = 'Own house?'
cat_cat_plot(var1, var2, label, legend)
# NAME_HOUSING_TYPE - FLAG_OWN_REALTY
var1 = 'NAME_HOUSING_TYPE'
var2 = 'FLAG_OWN_REALTY'
label = 'Housing Type'
legend = 'Own house?'
cat_cat_plot(var1, var2, label, legend)
# NAME_HOUSING_TYPE - NAME_FAMILY_STATUS
var1 = 'NAME_HOUSING_TYPE'
var2 = 'NAME_FAMILY_STATUS'
label = 'Housing Type'
legend = 'Family Status'
cat_cat_plot(var1, var2, label, legend)
Columns considered -
# Defining function for categorical - Continuous variable plotting
def cat_cont_plot(var1, var2, xlabel, ylabel):
plt.figure(figsize=(20,5))
plt.subplot(1,2,1)
plt.title('Defaulters\n', fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
sns.boxplot(x=var1,y=var2, data=application_df1)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.xticks(rotation=45)
plt.subplot(1,2,2)
plt.title('Others\n', fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
sns.boxplot(x=var1,y=var2, data=application_df0)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.xticks(rotation=45)
plt.show()
# NAME_CONTRACT_TYPE - AMT_CREDIT
var1 = 'NAME_CONTRACT_TYPE'
var2 = 'AMT_CREDIT'
xlabel = 'Contract Type'
ylabel = 'Credit Amount'
cat_cont_plot(var1, var2, xlabel, ylabel)
# NAME_INCOME_TYPE - AMT_CREDIT
var1 = 'NAME_INCOME_TYPE'
var2 = 'AMT_CREDIT'
xlabel = 'Income Type'
ylabel = 'Credit Amount'
cat_cont_plot(var1, var2, xlabel, ylabel)
# NAME_EDUCATION_TYPE - AMT_ANNUITY
var1 = 'NAME_EDUCATION_TYPE'
var2 = 'AMT_ANNUITY'
xlabel = 'Education Type'
ylabel = 'Annuity'
cat_cont_plot(var1, var2, xlabel, ylabel)
# NAME_HOUSING_TYPE - AMT_CREDIT
var1 = 'NAME_HOUSING_TYPE'
var2 = 'AMT_CREDIT'
xlabel = 'Housing Type'
ylabel = 'Credit Amount'
cat_cont_plot(var1, var2, xlabel, ylabel)
# OCCUPATION_TYPE - AMT_CREDIT
var1 = 'OCCUPATION_TYPE'
var2 = 'AMT_CREDIT'
xlabel = 'Occupation Type'
ylabel = 'Credit Amount'
cat_cont_plot(var1, var2, xlabel, ylabel)
Here we have considered the below continuous value columns for plotting the graphs -
# Defining function for Continuous - continuous plot
def cont_cont_plot(col1, col2, xlabel, ylabel):
plt.figure(figsize=[20,5])
plt.subplot(1,2,1)
plt.title('Defaulters\n', fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
sns.scatterplot(x = col1, y = col2, data = application_df1)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.xticks(rotation=45)
plt.subplot(1,2,2)
plt.title('Others\n', fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
sns.scatterplot(x = col1, y = col2, data = application_df0)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.xticks(rotation=45)
plt.show()
# AMT_CREDIT-AMT_ANNUITY
col1 = 'AMT_CREDIT'
col2 = 'AMT_ANNUITY'
xlabel = 'Credit Amount'
ylabel = 'Annuity'
cont_cont_plot(col1, col2, xlabel, ylabel)
# AMT_CREDIT-AMT_GOODS_PRICE
col1 = 'AMT_CREDIT'
col2 = 'AMT_GOODS_PRICE'
xlabel = 'Credit Amount'
ylabel = 'Goods Price'
cont_cont_plot(col1, col2, xlabel, ylabel)
# AMT_CREDIT-AMT_INCOME_TOTAL
col1 = 'AMT_CREDIT'
col2 = 'AMT_INCOME_TOTAL'
xlabel = 'Credit Amount'
ylabel = 'Total income'
cont_cont_plot(col1, col2, xlabel, ylabel)
We also have the previous application histories of the applicants. Let's explore that and see if we could find any trend.
# Read the previous data file
previous_df = pd.read_csv('previous_application.csv')
previous_df.head()
| SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | FLAG_LAST_APPL_PER_CONTRACT | NFLAG_LAST_APPL_IN_DAY | RATE_DOWN_PAYMENT | RATE_INTEREST_PRIMARY | RATE_INTEREST_PRIVILEGED | NAME_CASH_LOAN_PURPOSE | NAME_CONTRACT_STATUS | DAYS_DECISION | NAME_PAYMENT_TYPE | CODE_REJECT_REASON | NAME_TYPE_SUITE | NAME_CLIENT_TYPE | NAME_GOODS_CATEGORY | NAME_PORTFOLIO | NAME_PRODUCT_TYPE | CHANNEL_TYPE | SELLERPLACE_AREA | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2030495 | 271877 | Consumer loans | 1730.430 | 17145.0 | 17145.0 | 0.0 | 17145.0 | SATURDAY | 15 | Y | 1 | 0.0 | 0.182832 | 0.867336 | XAP | Approved | -73 | Cash through the bank | XAP | NaN | Repeater | Mobile | POS | XNA | Country-wide | 35 | Connectivity | 12.0 | middle | POS mobile with interest | 365243.0 | -42.0 | 300.0 | -42.0 | -37.0 | 0.0 |
| 1 | 2802425 | 108129 | Cash loans | 25188.615 | 607500.0 | 679671.0 | NaN | 607500.0 | THURSDAY | 11 | Y | 1 | NaN | NaN | NaN | XNA | Approved | -164 | XNA | XAP | Unaccompanied | Repeater | XNA | Cash | x-sell | Contact center | -1 | XNA | 36.0 | low_action | Cash X-Sell: low | 365243.0 | -134.0 | 916.0 | 365243.0 | 365243.0 | 1.0 |
| 2 | 2523466 | 122040 | Cash loans | 15060.735 | 112500.0 | 136444.5 | NaN | 112500.0 | TUESDAY | 11 | Y | 1 | NaN | NaN | NaN | XNA | Approved | -301 | Cash through the bank | XAP | Spouse, partner | Repeater | XNA | Cash | x-sell | Credit and cash offices | -1 | XNA | 12.0 | high | Cash X-Sell: high | 365243.0 | -271.0 | 59.0 | 365243.0 | 365243.0 | 1.0 |
| 3 | 2819243 | 176158 | Cash loans | 47041.335 | 450000.0 | 470790.0 | NaN | 450000.0 | MONDAY | 7 | Y | 1 | NaN | NaN | NaN | XNA | Approved | -512 | Cash through the bank | XAP | NaN | Repeater | XNA | Cash | x-sell | Credit and cash offices | -1 | XNA | 12.0 | middle | Cash X-Sell: middle | 365243.0 | -482.0 | -152.0 | -182.0 | -177.0 | 1.0 |
| 4 | 1784265 | 202054 | Cash loans | 31924.395 | 337500.0 | 404055.0 | NaN | 337500.0 | THURSDAY | 9 | Y | 1 | NaN | NaN | NaN | Repairs | Refused | -781 | Cash through the bank | HC | NaN | Repeater | XNA | Cash | walk-in | Credit and cash offices | -1 | XNA | 24.0 | high | Cash Street: high | NaN | NaN | NaN | NaN | NaN | NaN |
# Check the number of rows and columns in the dataframe
previous_df.shape
(1670214, 37)
# Check the column-wise info of the dataframe
previous_df.info(verbose=True)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1670214 entries, 0 to 1670213 Data columns (total 37 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SK_ID_PREV 1670214 non-null int64 1 SK_ID_CURR 1670214 non-null int64 2 NAME_CONTRACT_TYPE 1670214 non-null object 3 AMT_ANNUITY 1297979 non-null float64 4 AMT_APPLICATION 1670214 non-null float64 5 AMT_CREDIT 1670213 non-null float64 6 AMT_DOWN_PAYMENT 774370 non-null float64 7 AMT_GOODS_PRICE 1284699 non-null float64 8 WEEKDAY_APPR_PROCESS_START 1670214 non-null object 9 HOUR_APPR_PROCESS_START 1670214 non-null int64 10 FLAG_LAST_APPL_PER_CONTRACT 1670214 non-null object 11 NFLAG_LAST_APPL_IN_DAY 1670214 non-null int64 12 RATE_DOWN_PAYMENT 774370 non-null float64 13 RATE_INTEREST_PRIMARY 5951 non-null float64 14 RATE_INTEREST_PRIVILEGED 5951 non-null float64 15 NAME_CASH_LOAN_PURPOSE 1670214 non-null object 16 NAME_CONTRACT_STATUS 1670214 non-null object 17 DAYS_DECISION 1670214 non-null int64 18 NAME_PAYMENT_TYPE 1670214 non-null object 19 CODE_REJECT_REASON 1670214 non-null object 20 NAME_TYPE_SUITE 849809 non-null object 21 NAME_CLIENT_TYPE 1670214 non-null object 22 NAME_GOODS_CATEGORY 1670214 non-null object 23 NAME_PORTFOLIO 1670214 non-null object 24 NAME_PRODUCT_TYPE 1670214 non-null object 25 CHANNEL_TYPE 1670214 non-null object 26 SELLERPLACE_AREA 1670214 non-null int64 27 NAME_SELLER_INDUSTRY 1670214 non-null object 28 CNT_PAYMENT 1297984 non-null float64 29 NAME_YIELD_GROUP 1670214 non-null object 30 PRODUCT_COMBINATION 1669868 non-null object 31 DAYS_FIRST_DRAWING 997149 non-null float64 32 DAYS_FIRST_DUE 997149 non-null float64 33 DAYS_LAST_DUE_1ST_VERSION 997149 non-null float64 34 DAYS_LAST_DUE 997149 non-null float64 35 DAYS_TERMINATION 997149 non-null float64 36 NFLAG_INSURED_ON_APPROVAL 997149 non-null float64 dtypes: float64(15), int64(6), object(16) memory usage: 471.5+ MB
# Check the summary for the numeric columns
previous_df.describe()
| SK_ID_PREV | SK_ID_CURR | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | HOUR_APPR_PROCESS_START | NFLAG_LAST_APPL_IN_DAY | RATE_DOWN_PAYMENT | RATE_INTEREST_PRIMARY | RATE_INTEREST_PRIVILEGED | DAYS_DECISION | SELLERPLACE_AREA | CNT_PAYMENT | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1.670214e+06 | 1.670214e+06 | 1.297979e+06 | 1.670214e+06 | 1.670213e+06 | 7.743700e+05 | 1.284699e+06 | 1.670214e+06 | 1.670214e+06 | 774370.000000 | 5951.000000 | 5951.000000 | 1.670214e+06 | 1.670214e+06 | 1.297984e+06 | 997149.000000 | 997149.000000 | 997149.000000 | 997149.000000 | 997149.000000 | 997149.000000 |
| mean | 1.923089e+06 | 2.783572e+05 | 1.595512e+04 | 1.752339e+05 | 1.961140e+05 | 6.697402e+03 | 2.278473e+05 | 1.248418e+01 | 9.964675e-01 | 0.079637 | 0.188357 | 0.773503 | -8.806797e+02 | 3.139511e+02 | 1.605408e+01 | 342209.855039 | 13826.269337 | 33767.774054 | 76582.403064 | 81992.343838 | 0.332570 |
| std | 5.325980e+05 | 1.028148e+05 | 1.478214e+04 | 2.927798e+05 | 3.185746e+05 | 2.092150e+04 | 3.153966e+05 | 3.334028e+00 | 5.932963e-02 | 0.107823 | 0.087671 | 0.100879 | 7.790997e+02 | 7.127443e+03 | 1.456729e+01 | 88916.115834 | 72444.869708 | 106857.034789 | 149647.415123 | 153303.516729 | 0.471134 |
| min | 1.000001e+06 | 1.000010e+05 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | -9.000000e-01 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | -0.000015 | 0.034781 | 0.373150 | -2.922000e+03 | -1.000000e+00 | 0.000000e+00 | -2922.000000 | -2892.000000 | -2801.000000 | -2889.000000 | -2874.000000 | 0.000000 |
| 25% | 1.461857e+06 | 1.893290e+05 | 6.321780e+03 | 1.872000e+04 | 2.416050e+04 | 0.000000e+00 | 5.084100e+04 | 1.000000e+01 | 1.000000e+00 | 0.000000 | 0.160716 | 0.715645 | -1.300000e+03 | -1.000000e+00 | 6.000000e+00 | 365243.000000 | -1628.000000 | -1242.000000 | -1314.000000 | -1270.000000 | 0.000000 |
| 50% | 1.923110e+06 | 2.787145e+05 | 1.125000e+04 | 7.104600e+04 | 8.054100e+04 | 1.638000e+03 | 1.123200e+05 | 1.200000e+01 | 1.000000e+00 | 0.051605 | 0.189122 | 0.835095 | -5.810000e+02 | 3.000000e+00 | 1.200000e+01 | 365243.000000 | -831.000000 | -361.000000 | -537.000000 | -499.000000 | 0.000000 |
| 75% | 2.384280e+06 | 3.675140e+05 | 2.065842e+04 | 1.803600e+05 | 2.164185e+05 | 7.740000e+03 | 2.340000e+05 | 1.500000e+01 | 1.000000e+00 | 0.108909 | 0.193330 | 0.852537 | -2.800000e+02 | 8.200000e+01 | 2.400000e+01 | 365243.000000 | -411.000000 | 129.000000 | -74.000000 | -44.000000 | 1.000000 |
| max | 2.845382e+06 | 4.562550e+05 | 4.180581e+05 | 6.905160e+06 | 6.905160e+06 | 3.060045e+06 | 6.905160e+06 | 2.300000e+01 | 1.000000e+00 | 1.000000 | 1.000000 | 1.000000 | -1.000000e+00 | 4.000000e+06 | 8.400000e+01 | 365243.000000 | 365243.000000 | 365243.000000 | 365243.000000 | 365243.000000 | 1.000000 |
# Check for missing values in percentage
round(100 * previous_df.isnull().mean(),2)
SK_ID_PREV 0.00 SK_ID_CURR 0.00 NAME_CONTRACT_TYPE 0.00 AMT_ANNUITY 22.29 AMT_APPLICATION 0.00 AMT_CREDIT 0.00 AMT_DOWN_PAYMENT 53.64 AMT_GOODS_PRICE 23.08 WEEKDAY_APPR_PROCESS_START 0.00 HOUR_APPR_PROCESS_START 0.00 FLAG_LAST_APPL_PER_CONTRACT 0.00 NFLAG_LAST_APPL_IN_DAY 0.00 RATE_DOWN_PAYMENT 53.64 RATE_INTEREST_PRIMARY 99.64 RATE_INTEREST_PRIVILEGED 99.64 NAME_CASH_LOAN_PURPOSE 0.00 NAME_CONTRACT_STATUS 0.00 DAYS_DECISION 0.00 NAME_PAYMENT_TYPE 0.00 CODE_REJECT_REASON 0.00 NAME_TYPE_SUITE 49.12 NAME_CLIENT_TYPE 0.00 NAME_GOODS_CATEGORY 0.00 NAME_PORTFOLIO 0.00 NAME_PRODUCT_TYPE 0.00 CHANNEL_TYPE 0.00 SELLERPLACE_AREA 0.00 NAME_SELLER_INDUSTRY 0.00 CNT_PAYMENT 22.29 NAME_YIELD_GROUP 0.00 PRODUCT_COMBINATION 0.02 DAYS_FIRST_DRAWING 40.30 DAYS_FIRST_DUE 40.30 DAYS_LAST_DUE_1ST_VERSION 40.30 DAYS_LAST_DUE 40.30 DAYS_TERMINATION 40.30 NFLAG_INSURED_ON_APPROVAL 40.30 dtype: float64
# Extract the column names with more than 50% data missing and their respective missing value percentage
missing_percentage = round(100 * previous_df.isnull().sum() / len(previous_df.index), 2)
missing50 = [(col, percent) for col, percent in missing_percentage.items() if percent > 50]
# Extract the column names from the above list
cols_to_drop = [col for col, _ in missing50]
cols_to_drop
['AMT_DOWN_PAYMENT', 'RATE_DOWN_PAYMENT', 'RATE_INTEREST_PRIMARY', 'RATE_INTEREST_PRIVILEGED']
# Remove the columns with more than 50% missing values
previous_df.drop(cols_to_drop, axis = 1, inplace = True)
# Check the shape
previous_df.shape
(1670214, 33)
# Check the missing values for remaining
round(100 * previous_df.isnull().mean(),2)
SK_ID_PREV 0.00 SK_ID_CURR 0.00 NAME_CONTRACT_TYPE 0.00 AMT_ANNUITY 22.29 AMT_APPLICATION 0.00 AMT_CREDIT 0.00 AMT_GOODS_PRICE 23.08 WEEKDAY_APPR_PROCESS_START 0.00 HOUR_APPR_PROCESS_START 0.00 FLAG_LAST_APPL_PER_CONTRACT 0.00 NFLAG_LAST_APPL_IN_DAY 0.00 NAME_CASH_LOAN_PURPOSE 0.00 NAME_CONTRACT_STATUS 0.00 DAYS_DECISION 0.00 NAME_PAYMENT_TYPE 0.00 CODE_REJECT_REASON 0.00 NAME_TYPE_SUITE 49.12 NAME_CLIENT_TYPE 0.00 NAME_GOODS_CATEGORY 0.00 NAME_PORTFOLIO 0.00 NAME_PRODUCT_TYPE 0.00 CHANNEL_TYPE 0.00 SELLERPLACE_AREA 0.00 NAME_SELLER_INDUSTRY 0.00 CNT_PAYMENT 22.29 NAME_YIELD_GROUP 0.00 PRODUCT_COMBINATION 0.02 DAYS_FIRST_DRAWING 40.30 DAYS_FIRST_DUE 40.30 DAYS_LAST_DUE_1ST_VERSION 40.30 DAYS_LAST_DUE 40.30 DAYS_TERMINATION 40.30 NFLAG_INSURED_ON_APPROVAL 40.30 dtype: float64
# Check the summary for NFLAG_INSURED_ON_APPROVAL
previous_df['NFLAG_INSURED_ON_APPROVAL'].describe()
count 997149.000000 mean 0.332570 std 0.471134 min 0.000000 25% 0.000000 50% 0.000000 75% 1.000000 max 1.000000 Name: NFLAG_INSURED_ON_APPROVAL, dtype: float64
Suggestions-
previous_df[['DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE', 'DAYS_LAST_DUE_1ST_VERSION', 'DAYS_LAST_DUE', 'DAYS_TERMINATION','DAYS_DECISION']].describe()
| DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | DAYS_DECISION | |
|---|---|---|---|---|---|---|
| count | 997149.000000 | 997149.000000 | 997149.000000 | 997149.000000 | 997149.000000 | 1.670214e+06 |
| mean | 342209.855039 | 13826.269337 | 33767.774054 | 76582.403064 | 81992.343838 | -8.806797e+02 |
| std | 88916.115834 | 72444.869708 | 106857.034789 | 149647.415123 | 153303.516729 | 7.790997e+02 |
| min | -2922.000000 | -2892.000000 | -2801.000000 | -2889.000000 | -2874.000000 | -2.922000e+03 |
| 25% | 365243.000000 | -1628.000000 | -1242.000000 | -1314.000000 | -1270.000000 | -1.300000e+03 |
| 50% | 365243.000000 | -831.000000 | -361.000000 | -537.000000 | -499.000000 | -5.810000e+02 |
| 75% | 365243.000000 | -411.000000 | 129.000000 | -74.000000 | -44.000000 | -2.800000e+02 |
| max | 365243.000000 | 365243.000000 | 365243.000000 | 365243.000000 | 365243.000000 | -1.000000e+00 |
# Create a column list for DAYS columns
col_list = ['DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE', 'DAYS_LAST_DUE_1ST_VERSION', 'DAYS_LAST_DUE', 'DAYS_TERMINATION','DAYS_DECISION']
# Replace the values with their respective absolute values
for col in col_list:
previous_df[col] = abs(previous_df[col])
# Verify the changes
previous_df[['DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE', 'DAYS_LAST_DUE_1ST_VERSION', 'DAYS_LAST_DUE', 'DAYS_TERMINATION','DAYS_DECISION']].describe()
| DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | DAYS_DECISION | |
|---|---|---|---|---|---|---|
| count | 997149.000000 | 997149.000000 | 997149.000000 | 997149.000000 | 997149.000000 | 1.670214e+06 |
| mean | 342340.056543 | 15949.224065 | 35163.363265 | 78152.730207 | 83505.775017 | 8.806797e+02 |
| std | 88413.495220 | 72007.270877 | 106405.950190 | 148833.342466 | 152484.418802 | 7.790997e+02 |
| min | 2.000000 | 2.000000 | 0.000000 | 2.000000 | 2.000000 | 1.000000e+00 |
| 25% | 365243.000000 | 475.000000 | 257.000000 | 455.000000 | 447.000000 | 2.800000e+02 |
| 50% | 365243.000000 | 921.000000 | 741.000000 | 1155.000000 | 1171.000000 | 5.810000e+02 |
| 75% | 365243.000000 | 1825.000000 | 1735.000000 | 2418.000000 | 2501.000000 | 1.300000e+03 |
| max | 365243.000000 | 365243.000000 | 365243.000000 | 365243.000000 | 365243.000000 | 2.922000e+03 |
# Box plot AMT_ANNUITY
var = previous_df['AMT_ANNUITY']/100000
title = 'Annuity Amount\n'
label = 'Amount in lakhs'
outlier_plot(var,title,label)
# Check Summary
previous_df['AMT_ANNUITY'].describe()
count 1.297979e+06 mean 1.595512e+04 std 1.478214e+04 min 0.000000e+00 25% 6.321780e+03 50% 1.125000e+04 75% 2.065842e+04 max 4.180581e+05 Name: AMT_ANNUITY, dtype: float64
# Check the quantiles
previous_df['AMT_ANNUITY'].quantile([0.5,0.7,0.90,0.95,0.99])
0.50 11250.0000 0.70 17783.1270 0.90 34703.8200 0.95 45336.7800 0.99 69685.7886 Name: AMT_ANNUITY, dtype: float64
# Box plot AMT_GOODS_PRICE
var = previous_df['AMT_GOODS_PRICE']/100000
title = 'Goods Price\n'
label = 'Amount in lakhs'
outlier_plot(var,title,label)
# Check summary
previous_df['AMT_GOODS_PRICE'].describe()
count 1.284699e+06 mean 2.278473e+05 std 3.153966e+05 min 0.000000e+00 25% 5.084100e+04 50% 1.123200e+05 75% 2.340000e+05 max 6.905160e+06 Name: AMT_GOODS_PRICE, dtype: float64
# Check the quantiles
previous_df['AMT_GOODS_PRICE'].quantile([0.5,0.7,0.90,0.95,0.99])
0.50 112320.0 0.70 211500.0 0.90 585000.0 0.95 900000.0 0.99 1395000.0 Name: AMT_GOODS_PRICE, dtype: float64
# Box plot CNT_PAYMENT
var = previous_df['CNT_PAYMENT']
title = 'Term of previous credit\n'
label = 'Term'
outlier_plot(var,title,label)
# Check summary
previous_df['CNT_PAYMENT'].describe()
count 1.297984e+06 mean 1.605408e+01 std 1.456729e+01 min 0.000000e+00 25% 6.000000e+00 50% 1.200000e+01 75% 2.400000e+01 max 8.400000e+01 Name: CNT_PAYMENT, dtype: float64
Merge the application data frame and previous application data frame
# Merge both application_df and previous_df
finaldf = pd.merge(application_df, previous_df, on='SK_ID_CURR', how = 'inner')
# verify
finaldf.head()
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE_x | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT_x | AMT_ANNUITY_x | AMT_GOODS_PRICE_x | NAME_TYPE_SUITE_x | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START_x | HOUR_APPR_PROCESS_START_x | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_2 | EXT_SOURCE_3 | YEARS_BEGINEXPLUATATION_AVG | FLOORSMAX_AVG | YEARS_BEGINEXPLUATATION_MODE | FLOORSMAX_MODE | YEARS_BEGINEXPLUATATION_MEDI | FLOORSMAX_MEDI | TOTALAREA_MODE | EMERGENCYSTATE_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | AGE_GROUP | INCOME_GROUP | CREDIT_GROUP | SK_ID_PREV | NAME_CONTRACT_TYPE_y | AMT_ANNUITY_y | AMT_APPLICATION | AMT_CREDIT_y | AMT_GOODS_PRICE_y | WEEKDAY_APPR_PROCESS_START_y | HOUR_APPR_PROCESS_START_y | FLAG_LAST_APPL_PER_CONTRACT | NFLAG_LAST_APPL_IN_DAY | NAME_CASH_LOAN_PURPOSE | NAME_CONTRACT_STATUS | DAYS_DECISION | NAME_PAYMENT_TYPE | CODE_REJECT_REASON | NAME_TYPE_SUITE_y | NAME_CLIENT_TYPE | NAME_GOODS_CATEGORY | NAME_PORTFOLIO | NAME_PRODUCT_TYPE | CHANNEL_TYPE | SELLERPLACE_AREA | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | 351000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.018801 | 9461 | 637 | 3648 | 2120 | 1 | 1 | 0 | 1 | 1 | 0 | Laborers | 1.0 | 2 | 2 | WEDNESDAY | 10 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.262949 | 0.139376 | 0.9722 | 0.0833 | 0.9722 | 0.0833 | 0.9722 | 0.0833 | 0.0149 | No | 2.0 | 2.0 | 2.0 | 2.0 | 1134.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 20-30 | 2-5L | 1-5L | 1038818 | Consumer loans | 9251.775 | 179055.0 | 179055.0 | 179055.0 | SATURDAY | 9 | Y | 1 | XAP | Approved | 606 | XNA | XAP | NaN | New | Vehicles | POS | XNA | Stone | 500 | Auto technology | 24.0 | low_normal | POS other with interest | 365243.0 | 565.0 | 125.0 | 25.0 | 17.0 | 0.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | Family | State servant | Higher education | Married | House / apartment | 0.003541 | 16765 | 1188 | 1186 | 291 | 1 | 1 | 0 | 1 | 1 | 0 | Core staff | 2.0 | 1 | 1 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | School | 0.622246 | NaN | 0.9851 | 0.2917 | 0.9851 | 0.2917 | 0.9851 | 0.2917 | 0.0714 | No | 1.0 | 0.0 | 1.0 | 0.0 | 828.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 40-50 | 2-5L | 10-20L | 1810518 | Cash loans | 98356.995 | 900000.0 | 1035882.0 | 900000.0 | FRIDAY | 12 | Y | 1 | XNA | Approved | 746 | XNA | XAP | Unaccompanied | Repeater | XNA | Cash | x-sell | Credit and cash offices | -1 | XNA | 12.0 | low_normal | Cash X-Sell: low | 365243.0 | 716.0 | 386.0 | 536.0 | 527.0 | 1.0 |
| 2 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | Family | State servant | Higher education | Married | House / apartment | 0.003541 | 16765 | 1188 | 1186 | 291 | 1 | 1 | 0 | 1 | 1 | 0 | Core staff | 2.0 | 1 | 1 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | School | 0.622246 | NaN | 0.9851 | 0.2917 | 0.9851 | 0.2917 | 0.9851 | 0.2917 | 0.0714 | No | 1.0 | 0.0 | 1.0 | 0.0 | 828.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 40-50 | 2-5L | 10-20L | 2636178 | Consumer loans | 64567.665 | 337500.0 | 348637.5 | 337500.0 | SUNDAY | 17 | Y | 1 | XAP | Approved | 828 | Cash through the bank | XAP | Family | Refreshed | Furniture | POS | XNA | Stone | 1400 | Furniture | 6.0 | middle | POS industry with interest | 365243.0 | 797.0 | 647.0 | 647.0 | 639.0 | 0.0 |
| 3 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | Family | State servant | Higher education | Married | House / apartment | 0.003541 | 16765 | 1188 | 1186 | 291 | 1 | 1 | 0 | 1 | 1 | 0 | Core staff | 2.0 | 1 | 1 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | School | 0.622246 | NaN | 0.9851 | 0.2917 | 0.9851 | 0.2917 | 0.9851 | 0.2917 | 0.0714 | No | 1.0 | 0.0 | 1.0 | 0.0 | 828.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 40-50 | 2-5L | 10-20L | 2396755 | Consumer loans | 6737.310 | 68809.5 | 68053.5 | 68809.5 | SATURDAY | 15 | Y | 1 | XAP | Approved | 2341 | Cash through the bank | XAP | Family | Refreshed | Consumer Electronics | POS | XNA | Country-wide | 200 | Consumer electronics | 12.0 | middle | POS household with interest | 365243.0 | 2310.0 | 1980.0 | 1980.0 | 1976.0 | 1.0 |
| 4 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | 135000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.010032 | 19046 | 225 | 4260 | 2531 | 1 | 1 | 1 | 1 | 1 | 0 | Laborers | 1.0 | 2 | 2 | MONDAY | 9 | 0 | 0 | 0 | 0 | 0 | 0 | Government | 0.555912 | 0.729567 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 815.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 50-60 | Upto 1L | 1-5L | 1564014 | Consumer loans | 5357.250 | 24282.0 | 20106.0 | 24282.0 | FRIDAY | 5 | Y | 1 | XAP | Approved | 815 | Cash through the bank | XAP | Unaccompanied | New | Mobile | POS | XNA | Regional / Local | 30 | Connectivity | 4.0 | middle | POS mobile without interest | 365243.0 | 784.0 | 694.0 | 724.0 | 714.0 | 0.0 |
# Check the column info
finaldf.info(verbose=True)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1413701 entries, 0 to 1413700 Data columns (total 116 columns): # Column Dtype --- ------ ----- 0 SK_ID_CURR int64 1 TARGET int64 2 NAME_CONTRACT_TYPE_x object 3 CODE_GENDER object 4 FLAG_OWN_CAR object 5 FLAG_OWN_REALTY object 6 CNT_CHILDREN int64 7 AMT_INCOME_TOTAL float64 8 AMT_CREDIT_x float64 9 AMT_ANNUITY_x float64 10 AMT_GOODS_PRICE_x float64 11 NAME_TYPE_SUITE_x object 12 NAME_INCOME_TYPE object 13 NAME_EDUCATION_TYPE object 14 NAME_FAMILY_STATUS object 15 NAME_HOUSING_TYPE object 16 REGION_POPULATION_RELATIVE float64 17 DAYS_BIRTH int64 18 DAYS_EMPLOYED int64 19 DAYS_REGISTRATION int64 20 DAYS_ID_PUBLISH int64 21 FLAG_MOBIL int64 22 FLAG_EMP_PHONE int64 23 FLAG_WORK_PHONE int64 24 FLAG_CONT_MOBILE int64 25 FLAG_PHONE int64 26 FLAG_EMAIL int64 27 OCCUPATION_TYPE object 28 CNT_FAM_MEMBERS float64 29 REGION_RATING_CLIENT int64 30 REGION_RATING_CLIENT_W_CITY int64 31 WEEKDAY_APPR_PROCESS_START_x object 32 HOUR_APPR_PROCESS_START_x int64 33 REG_REGION_NOT_LIVE_REGION int64 34 REG_REGION_NOT_WORK_REGION int64 35 LIVE_REGION_NOT_WORK_REGION int64 36 REG_CITY_NOT_LIVE_CITY int64 37 REG_CITY_NOT_WORK_CITY int64 38 LIVE_CITY_NOT_WORK_CITY int64 39 ORGANIZATION_TYPE object 40 EXT_SOURCE_2 float64 41 EXT_SOURCE_3 float64 42 YEARS_BEGINEXPLUATATION_AVG float64 43 FLOORSMAX_AVG float64 44 YEARS_BEGINEXPLUATATION_MODE float64 45 FLOORSMAX_MODE float64 46 YEARS_BEGINEXPLUATATION_MEDI float64 47 FLOORSMAX_MEDI float64 48 TOTALAREA_MODE float64 49 EMERGENCYSTATE_MODE object 50 OBS_30_CNT_SOCIAL_CIRCLE float64 51 DEF_30_CNT_SOCIAL_CIRCLE float64 52 OBS_60_CNT_SOCIAL_CIRCLE float64 53 DEF_60_CNT_SOCIAL_CIRCLE float64 54 DAYS_LAST_PHONE_CHANGE float64 55 FLAG_DOCUMENT_2 int64 56 FLAG_DOCUMENT_3 int64 57 FLAG_DOCUMENT_4 int64 58 FLAG_DOCUMENT_5 int64 59 FLAG_DOCUMENT_6 int64 60 FLAG_DOCUMENT_7 int64 61 FLAG_DOCUMENT_8 int64 62 FLAG_DOCUMENT_9 int64 63 FLAG_DOCUMENT_10 int64 64 FLAG_DOCUMENT_11 int64 65 FLAG_DOCUMENT_12 int64 66 FLAG_DOCUMENT_13 int64 67 FLAG_DOCUMENT_14 int64 68 FLAG_DOCUMENT_15 int64 69 FLAG_DOCUMENT_16 int64 70 FLAG_DOCUMENT_17 int64 71 FLAG_DOCUMENT_18 int64 72 FLAG_DOCUMENT_19 int64 73 FLAG_DOCUMENT_20 int64 74 FLAG_DOCUMENT_21 int64 75 AMT_REQ_CREDIT_BUREAU_HOUR float64 76 AMT_REQ_CREDIT_BUREAU_DAY float64 77 AMT_REQ_CREDIT_BUREAU_WEEK float64 78 AMT_REQ_CREDIT_BUREAU_MON float64 79 AMT_REQ_CREDIT_BUREAU_QRT float64 80 AMT_REQ_CREDIT_BUREAU_YEAR float64 81 AGE_GROUP category 82 INCOME_GROUP category 83 CREDIT_GROUP category 84 SK_ID_PREV int64 85 NAME_CONTRACT_TYPE_y object 86 AMT_ANNUITY_y float64 87 AMT_APPLICATION float64 88 AMT_CREDIT_y float64 89 AMT_GOODS_PRICE_y float64 90 WEEKDAY_APPR_PROCESS_START_y object 91 HOUR_APPR_PROCESS_START_y int64 92 FLAG_LAST_APPL_PER_CONTRACT object 93 NFLAG_LAST_APPL_IN_DAY int64 94 NAME_CASH_LOAN_PURPOSE object 95 NAME_CONTRACT_STATUS object 96 DAYS_DECISION int64 97 NAME_PAYMENT_TYPE object 98 CODE_REJECT_REASON object 99 NAME_TYPE_SUITE_y object 100 NAME_CLIENT_TYPE object 101 NAME_GOODS_CATEGORY object 102 NAME_PORTFOLIO object 103 NAME_PRODUCT_TYPE object 104 CHANNEL_TYPE object 105 SELLERPLACE_AREA int64 106 NAME_SELLER_INDUSTRY object 107 CNT_PAYMENT float64 108 NAME_YIELD_GROUP object 109 PRODUCT_COMBINATION object 110 DAYS_FIRST_DRAWING float64 111 DAYS_FIRST_DUE float64 112 DAYS_LAST_DUE_1ST_VERSION float64 113 DAYS_LAST_DUE float64 114 DAYS_TERMINATION float64 115 NFLAG_INSURED_ON_APPROVAL float64 dtypes: category(3), float64(37), int64(47), object(29) memory usage: 1.2+ GB
# Rename the duplicated columns
finaldf = finaldf.rename({'NAME_CONTRACT_TYPE_y':'NAME_CONTRACT_TYPE_PREV',
'AMT_ANNUITY_y':'AMT_ANNUITY_PREV',
'AMT_CREDIT_y':'AMT_CREDIT_PREV',
'AMT_GOODS_PRICE_y':'AMT_GOODS_PRICE_PREV',
'NAME_TYPE_SUITE_y':'NAME_TYPE_SUITE_PREV',
'NAME_TYPE_SUITE_x':'NAME_TYPE_SUITE_CURR',
'AMT_GOODS_PRICE_x':'AMT_GOODS_PRICE_CURR',
'AMT_ANNUITY_x':'AMT_ANNUITY_CURR',
'AMT_CREDIT_x':'AMT_CREDIT_CURR',
'NAME_CONTRACT_TYPE_x':'NAME_CONTRACT_TYPE_CURR'}, axis=1)
#Verify
finaldf.info(verbose=True)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1413701 entries, 0 to 1413700 Data columns (total 116 columns): # Column Dtype --- ------ ----- 0 SK_ID_CURR int64 1 TARGET int64 2 NAME_CONTRACT_TYPE_CURR object 3 CODE_GENDER object 4 FLAG_OWN_CAR object 5 FLAG_OWN_REALTY object 6 CNT_CHILDREN int64 7 AMT_INCOME_TOTAL float64 8 AMT_CREDIT_CURR float64 9 AMT_ANNUITY_CURR float64 10 AMT_GOODS_PRICE_CURR float64 11 NAME_TYPE_SUITE_CURR object 12 NAME_INCOME_TYPE object 13 NAME_EDUCATION_TYPE object 14 NAME_FAMILY_STATUS object 15 NAME_HOUSING_TYPE object 16 REGION_POPULATION_RELATIVE float64 17 DAYS_BIRTH int64 18 DAYS_EMPLOYED int64 19 DAYS_REGISTRATION int64 20 DAYS_ID_PUBLISH int64 21 FLAG_MOBIL int64 22 FLAG_EMP_PHONE int64 23 FLAG_WORK_PHONE int64 24 FLAG_CONT_MOBILE int64 25 FLAG_PHONE int64 26 FLAG_EMAIL int64 27 OCCUPATION_TYPE object 28 CNT_FAM_MEMBERS float64 29 REGION_RATING_CLIENT int64 30 REGION_RATING_CLIENT_W_CITY int64 31 WEEKDAY_APPR_PROCESS_START_x object 32 HOUR_APPR_PROCESS_START_x int64 33 REG_REGION_NOT_LIVE_REGION int64 34 REG_REGION_NOT_WORK_REGION int64 35 LIVE_REGION_NOT_WORK_REGION int64 36 REG_CITY_NOT_LIVE_CITY int64 37 REG_CITY_NOT_WORK_CITY int64 38 LIVE_CITY_NOT_WORK_CITY int64 39 ORGANIZATION_TYPE object 40 EXT_SOURCE_2 float64 41 EXT_SOURCE_3 float64 42 YEARS_BEGINEXPLUATATION_AVG float64 43 FLOORSMAX_AVG float64 44 YEARS_BEGINEXPLUATATION_MODE float64 45 FLOORSMAX_MODE float64 46 YEARS_BEGINEXPLUATATION_MEDI float64 47 FLOORSMAX_MEDI float64 48 TOTALAREA_MODE float64 49 EMERGENCYSTATE_MODE object 50 OBS_30_CNT_SOCIAL_CIRCLE float64 51 DEF_30_CNT_SOCIAL_CIRCLE float64 52 OBS_60_CNT_SOCIAL_CIRCLE float64 53 DEF_60_CNT_SOCIAL_CIRCLE float64 54 DAYS_LAST_PHONE_CHANGE float64 55 FLAG_DOCUMENT_2 int64 56 FLAG_DOCUMENT_3 int64 57 FLAG_DOCUMENT_4 int64 58 FLAG_DOCUMENT_5 int64 59 FLAG_DOCUMENT_6 int64 60 FLAG_DOCUMENT_7 int64 61 FLAG_DOCUMENT_8 int64 62 FLAG_DOCUMENT_9 int64 63 FLAG_DOCUMENT_10 int64 64 FLAG_DOCUMENT_11 int64 65 FLAG_DOCUMENT_12 int64 66 FLAG_DOCUMENT_13 int64 67 FLAG_DOCUMENT_14 int64 68 FLAG_DOCUMENT_15 int64 69 FLAG_DOCUMENT_16 int64 70 FLAG_DOCUMENT_17 int64 71 FLAG_DOCUMENT_18 int64 72 FLAG_DOCUMENT_19 int64 73 FLAG_DOCUMENT_20 int64 74 FLAG_DOCUMENT_21 int64 75 AMT_REQ_CREDIT_BUREAU_HOUR float64 76 AMT_REQ_CREDIT_BUREAU_DAY float64 77 AMT_REQ_CREDIT_BUREAU_WEEK float64 78 AMT_REQ_CREDIT_BUREAU_MON float64 79 AMT_REQ_CREDIT_BUREAU_QRT float64 80 AMT_REQ_CREDIT_BUREAU_YEAR float64 81 AGE_GROUP category 82 INCOME_GROUP category 83 CREDIT_GROUP category 84 SK_ID_PREV int64 85 NAME_CONTRACT_TYPE_PREV object 86 AMT_ANNUITY_PREV float64 87 AMT_APPLICATION float64 88 AMT_CREDIT_PREV float64 89 AMT_GOODS_PRICE_PREV float64 90 WEEKDAY_APPR_PROCESS_START_y object 91 HOUR_APPR_PROCESS_START_y int64 92 FLAG_LAST_APPL_PER_CONTRACT object 93 NFLAG_LAST_APPL_IN_DAY int64 94 NAME_CASH_LOAN_PURPOSE object 95 NAME_CONTRACT_STATUS object 96 DAYS_DECISION int64 97 NAME_PAYMENT_TYPE object 98 CODE_REJECT_REASON object 99 NAME_TYPE_SUITE_PREV object 100 NAME_CLIENT_TYPE object 101 NAME_GOODS_CATEGORY object 102 NAME_PORTFOLIO object 103 NAME_PRODUCT_TYPE object 104 CHANNEL_TYPE object 105 SELLERPLACE_AREA int64 106 NAME_SELLER_INDUSTRY object 107 CNT_PAYMENT float64 108 NAME_YIELD_GROUP object 109 PRODUCT_COMBINATION object 110 DAYS_FIRST_DRAWING float64 111 DAYS_FIRST_DUE float64 112 DAYS_LAST_DUE_1ST_VERSION float64 113 DAYS_LAST_DUE float64 114 DAYS_TERMINATION float64 115 NFLAG_INSURED_ON_APPROVAL float64 dtypes: category(3), float64(37), int64(47), object(29) memory usage: 1.2+ GB
# Remove unwanted columns
finaldf.drop(['REGION_POPULATION_RELATIVE','REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY',
'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION',
'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY','REG_CITY_NOT_WORK_CITY',
'LIVE_CITY_NOT_WORK_CITY', 'YEARS_BEGINEXPLUATATION_AVG','YEARS_BEGINEXPLUATATION_MEDI',
'YEARS_BEGINEXPLUATATION_MODE','FLOORSMAX_AVG','FLOORSMAX_MODE','FLOORSMAX_MEDI','EMERGENCYSTATE_MODE',
'OBS_30_CNT_SOCIAL_CIRCLE','DEF_30_CNT_SOCIAL_CIRCLE','OBS_60_CNT_SOCIAL_CIRCLE','DEF_60_CNT_SOCIAL_CIRCLE',
'DAYS_LAST_PHONE_CHANGE','AMT_REQ_CREDIT_BUREAU_HOUR','AMT_REQ_CREDIT_BUREAU_DAY','AMT_REQ_CREDIT_BUREAU_WEEK',
'AMT_REQ_CREDIT_BUREAU_MON','AMT_REQ_CREDIT_BUREAU_QRT','AMT_REQ_CREDIT_BUREAU_YEAR',
'DAYS_FIRST_DRAWING','DAYS_FIRST_DUE','DAYS_LAST_DUE_1ST_VERSION','DAYS_LAST_DUE','DAYS_TERMINATION',
'NFLAG_INSURED_ON_APPROVAL', 'FLAG_DOCUMENT_2','FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4','FLAG_DOCUMENT_5',
'FLAG_DOCUMENT_6','FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8',
'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10','FLAG_DOCUMENT_11',
'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13','FLAG_DOCUMENT_14',
'FLAG_DOCUMENT_15','FLAG_DOCUMENT_16','FLAG_DOCUMENT_17','FLAG_DOCUMENT_18','FLAG_DOCUMENT_19',
'FLAG_DOCUMENT_20','FLAG_DOCUMENT_21','FLAG_MOBIL','FLAG_EMP_PHONE', 'FLAG_WORK_PHONE','FLAG_CONT_MOBILE',
'FLAG_PHONE','FLAG_EMAIL', 'EXT_SOURCE_2',
'EXT_SOURCE_3', 'TOTALAREA_MODE','FLAG_LAST_APPL_PER_CONTRACT',
'NFLAG_LAST_APPL_IN_DAY', 'SELLERPLACE_AREA','WEEKDAY_APPR_PROCESS_START_x',
'WEEKDAY_APPR_PROCESS_START_y','HOUR_APPR_PROCESS_START_x','HOUR_APPR_PROCESS_START_y'],axis=1,inplace=True)
#Verify
finaldf.info(verbose=True)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1413701 entries, 0 to 1413700 Data columns (total 47 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SK_ID_CURR 1413701 non-null int64 1 TARGET 1413701 non-null int64 2 NAME_CONTRACT_TYPE_CURR 1413701 non-null object 3 CODE_GENDER 1413646 non-null object 4 FLAG_OWN_CAR 1413701 non-null object 5 FLAG_OWN_REALTY 1413701 non-null object 6 CNT_CHILDREN 1413701 non-null int64 7 AMT_INCOME_TOTAL 1413701 non-null float64 8 AMT_CREDIT_CURR 1413701 non-null float64 9 AMT_ANNUITY_CURR 1413608 non-null float64 10 AMT_GOODS_PRICE_CURR 1412493 non-null float64 11 NAME_TYPE_SUITE_CURR 1410175 non-null object 12 NAME_INCOME_TYPE 1413701 non-null object 13 NAME_EDUCATION_TYPE 1413701 non-null object 14 NAME_FAMILY_STATUS 1413701 non-null object 15 NAME_HOUSING_TYPE 1413701 non-null object 16 DAYS_BIRTH 1413701 non-null int64 17 DAYS_EMPLOYED 1413701 non-null int64 18 DAYS_REGISTRATION 1413701 non-null int64 19 DAYS_ID_PUBLISH 1413701 non-null int64 20 OCCUPATION_TYPE 956121 non-null object 21 CNT_FAM_MEMBERS 1413701 non-null float64 22 ORGANIZATION_TYPE 1413701 non-null object 23 AGE_GROUP 1413701 non-null category 24 INCOME_GROUP 1413698 non-null category 25 CREDIT_GROUP 1413701 non-null category 26 SK_ID_PREV 1413701 non-null int64 27 NAME_CONTRACT_TYPE_PREV 1413701 non-null object 28 AMT_ANNUITY_PREV 1106483 non-null float64 29 AMT_APPLICATION 1413701 non-null float64 30 AMT_CREDIT_PREV 1413700 non-null float64 31 AMT_GOODS_PRICE_PREV 1094176 non-null float64 32 NAME_CASH_LOAN_PURPOSE 1413701 non-null object 33 NAME_CONTRACT_STATUS 1413701 non-null object 34 DAYS_DECISION 1413701 non-null int64 35 NAME_PAYMENT_TYPE 1413701 non-null object 36 CODE_REJECT_REASON 1413701 non-null object 37 NAME_TYPE_SUITE_PREV 719029 non-null object 38 NAME_CLIENT_TYPE 1413701 non-null object 39 NAME_GOODS_CATEGORY 1413701 non-null object 40 NAME_PORTFOLIO 1413701 non-null object 41 NAME_PRODUCT_TYPE 1413701 non-null object 42 CHANNEL_TYPE 1413701 non-null object 43 NAME_SELLER_INDUSTRY 1413701 non-null object 44 CNT_PAYMENT 1106488 non-null float64 45 NAME_YIELD_GROUP 1413701 non-null object 46 PRODUCT_COMBINATION 1413388 non-null object dtypes: category(3), float64(10), int64(9), object(25) memory usage: 478.6+ MB
Categorical Analysis
#NAME_CONTRACT_STATUS
tempdf = finaldf[['TARGET','NAME_CONTRACT_STATUS']].groupby(['NAME_CONTRACT_STATUS'], as_index=False).sum()
tempdf.sort_values(by='TARGET', ascending=False, inplace=True)
sns.barplot(x='NAME_CONTRACT_STATUS', y = 'TARGET', data = tempdf)
plt.title('Previous Contract Status\n',fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'} )
plt.xlabel('Loan status')
plt.ylabel('Number of defaults')
plt.xticks(rotation=45, ha='right')
plt.show()
# NAME_CONTRACT_TYPE_PREV
tempdf = finaldf[['TARGET','NAME_CONTRACT_TYPE_PREV']].groupby(['NAME_CONTRACT_TYPE_PREV'], as_index=False).sum()
tempdf.sort_values(by='TARGET', ascending=False, inplace=True)
sns.barplot(x='NAME_CONTRACT_TYPE_PREV', y = 'TARGET', data = tempdf)
plt.title('Previous Contract Type\n',fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
plt.xlabel('Contract type')
plt.ylabel('Number of defaulters')
plt.xticks(rotation=45, ha='right')
plt.show()
# NAME_CASH_LOAN_PURPOSE
tempdf = finaldf[finaldf['NAME_CASH_LOAN_PURPOSE'] != 'XAP']
tempdf = tempdf[tempdf['NAME_CASH_LOAN_PURPOSE'] != 'XNA']
tempdf = tempdf[['TARGET','NAME_CASH_LOAN_PURPOSE']].groupby(['NAME_CASH_LOAN_PURPOSE'], as_index=False).sum()
plt.figure(figsize=[20,10])
sns.barplot(x='NAME_CASH_LOAN_PURPOSE', y = 'TARGET', data = tempdf)
plt.title('Cash Loan Purpose\n',fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
plt.xlabel('Purpose')
plt.ylabel('Number of defaulters')
plt.xticks(rotation=90, ha='right')
plt.show()
Numerical Analysis
# Numerical data analysis
sns.distplot(finaldf['AMT_CREDIT_PREV'] , hist=False)
plt.title('Credit Amount of Previous Applications\n', fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
plt.xlabel('Credit Amount')
plt.show()
# Numerical data analysis
sns.distplot(finaldf['AMT_ANNUITY_PREV'] , hist=False)
plt.title('Annuity of Previous Applications\n', fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
plt.xlabel('Annuity')
plt.show()
# Check the correlation
corrdf = finaldf[['AMT_ANNUITY_PREV','AMT_APPLICATION','AMT_CREDIT_PREV','AMT_GOODS_PRICE_PREV']].corr()
corrdf
| AMT_ANNUITY_PREV | AMT_APPLICATION | AMT_CREDIT_PREV | AMT_GOODS_PRICE_PREV | |
|---|---|---|---|---|
| AMT_ANNUITY_PREV | 1.000000 | 0.809972 | 0.817977 | 0.822204 |
| AMT_APPLICATION | 0.809972 | 1.000000 | 0.975683 | 0.999871 |
| AMT_CREDIT_PREV | 0.817977 | 0.975683 | 1.000000 | 0.993201 |
| AMT_GOODS_PRICE_PREV | 0.822204 | 0.999871 | 0.993201 | 1.000000 |
# Plot correlation heatmap for numerical variables
plt.figure(figsize=[10,5])
sns.heatmap(corrdf, cmap="YlGnBu", annot = True)
plt.title('Correlation - Previous Applications\n', fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
plt.show()
# Filter out 'XAP' and 'XNA' from 'NAME_CASH_LOAN_PURPOSE'
tempdf = finaldf[(finaldf['NAME_CASH_LOAN_PURPOSE'] != 'XAP') & (finaldf['NAME_CASH_LOAN_PURPOSE'] != 'XNA')]
plt.figure(figsize=[20, 10])
plt.title('Cash Loan Purpose\n', fontdict={'fontsize': 20, 'fontweight': 5, 'color': 'Brown'})
sns.countplot(x='NAME_CASH_LOAN_PURPOSE', hue='TARGET', data=tempdf)
plt.xlabel('Cash loan purpose')
plt.xticks(rotation=90)
plt.show()
# NAME_CONTRACT_STATUS - CODE_REJECT_REASON
# Filter out 'XAP' and 'XNA' from 'NAME_CASH_LOAN_PURPOSE'
tempdf = finaldf[(finaldf['CODE_REJECT_REASON'] != 'XAP') & (finaldf['CODE_REJECT_REASON'] != 'XNA')]
plt.figure(figsize=[20, 10])
plt.title('Rejection Reason\n', fontdict={'fontsize': 20, 'fontweight': 5, 'color': 'Brown'})
sns.countplot(x='CODE_REJECT_REASON', hue='TARGET', data=tempdf)
plt.xlabel('Rejection reason')
plt.xticks(rotation=90)
plt.show()
# NAME_CASH_LOAN_PURPOSE - AMT_CREDIT_PREV
plt.figure(figsize=(20,10))
plt.title('Cash Loan Purpose Vs Credit Amount\n', fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
sns.barplot(x=finaldf['NAME_CASH_LOAN_PURPOSE'], y=finaldf['AMT_CREDIT_PREV'], hue=finaldf['TARGET'] )
plt.xlabel('Cash Loan Purpose')
plt.ylabel('Credit amount')
plt.xticks(rotation=90)
plt.legend(title='Is defaulter?', loc= 'upper right')
plt.show()
# NAME_CONTRACT_STATUS - AMT_INCOME_TOTAL
plt.figure(figsize=(10,8))
plt.title('Contract Status Vs Total Income\n', fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
sns.barplot(x=finaldf['NAME_CONTRACT_STATUS'], y=finaldf['AMT_INCOME_TOTAL'], hue=finaldf['TARGET'] )
plt.xlabel('Contract Status')
plt.ylabel('Total Income')
plt.legend(title='Is defaulter?', loc= 'upper right')
plt.show()
# AMT_CREDIT_PREV - AMT_APPLICATION
plt.figure(figsize=[20,5])
plt.title('Previous Credit Amount Vs Amount Applied\n', fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
sns.scatterplot(x = finaldf['AMT_CREDIT_PREV'], y = finaldf['AMT_APPLICATION'])
plt.xlabel('Credit Amount')
plt.ylabel('Amount Applied')
plt.show()
# AMT_CREDIT_PREV - AMT_GOODS_PRICE_PREV
plt.figure(figsize=[20,5])
plt.title('Previous Credit Amount Vs Goods Price\n', fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
sns.scatterplot(x = finaldf['AMT_CREDIT_PREV'], y = finaldf['AMT_GOODS_PRICE_PREV'])
plt.xlabel('Credit Amount')
plt.ylabel('Goods Price')
plt.show()
From the above analysis, we have gathered the below insights -
EDA Conclusion -
Young males with lower secondary education and of lower income group and staying with parents or in a rented house, applying for low-range cash contract, should be denied.
Females are likely to repay but not if they are on maternity leave. Hence, bank can reduce the loan amount for female applicants who are on maternity leave.
Since people taking cash loans for repairs and urgent needs are more likely to default, bank can refuse them.
Since the people who have unused offers are more likely to default even though they have comparatively high total income, they can be offered loan at a higher interest rate.
Banks can target businessmen, students and working class people with academic degree/ higher education as they have no difficulty in repayment.
Bank can also approve loans taken on purpose for buying home or garage as there less chances of defaulting.
import matplotlib.style as style
import seaborn as sns
import itertools
from sklearn.cluster import KMeans, MiniBatchKMeans, AgglomerativeClustering, DBSCAN
from kmodes import kmodes
from kmodes.kprototypes import KPrototypes
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, learning_curve, train_test_split
from sklearn.metrics import roc_auc_score, confusion_matrix, roc_curve, roc_auc_score
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
%matplotlib inline
applicationdf = pd.read_csv('application_data.csv')
previousdf = pd.read_csv('previous_application.csv')
numeric_value_appl = applicationdf.select_dtypes(include=['int', 'float']).fillna(-1)
# For categorical value, I will fill missing value with 'None'
categorical_value_appl = applicationdf.select_dtypes(include=['object']).fillna('None')
application_cleaned = numeric_value_appl.join(categorical_value_appl)
print(application_cleaned.isnull().any().unique()) # [False]
# Capture 106 columns, which matched the number from the profiling report, and now all the numeric
# columns have no missing value.
print('The shape of cleaned dataset is {}'.format(application_cleaned.shape))
print('The shape of original dataset is {}'.format(applicationdf.shape))
[False] The shape of cleaned dataset is (307511, 122) The shape of original dataset is (307511, 122)
application_cleaned.loc[application_cleaned['DAYS_EMPLOYED'] == 365243, 'DAYS_EMPLOYED'] = 0
application_cleaned.loc[application_cleaned['DAYS_EMPLOYED'] == 1, 'DAYS_EMPLOYED'] = 0
print(application_cleaned['DAYS_EMPLOYED'].max()) # now the maximum value is 0
application_cleaned.loc[application_cleaned['EMERGENCYSTATE_MODE'] == 'Yes','EMERGENCYSTATE_MODE'] = 1
application_cleaned.loc[application_cleaned['EMERGENCYSTATE_MODE'] == 'No', 'EMERGENCYSTATE_MODE'] = 0
print('Finish!')
print('The shape of cleaned dataset is {}'.format(application_cleaned.shape))
0 Finish! The shape of cleaned dataset is (307511, 122)
column_df_appl = pd.DataFrame(numeric_value_appl.columns.values, columns=['col_name'])
column_df_appl['Contains_day'] = column_df_appl['col_name'].str.contains('days', case=False)
col_for_modification_appl = column_df_appl[column_df_appl['Contains_day'] == True]['col_name'].values
print('Here are columns need modification:')
print(col_for_modification_appl)
for col in col_for_modification_appl:
application_cleaned.loc[:,col] = round(abs(application_cleaned.loc[:, col] // 365) , 2)
name = col.replace('DAYS', 'YEARS')
application_cleaned.rename(columns={col:name}, inplace=True)
print('The shape of cleaned dataset is {}'.format(application_cleaned.shape))
Here are columns need modification: ['DAYS_BIRTH' 'DAYS_EMPLOYED' 'DAYS_REGISTRATION' 'DAYS_ID_PUBLISH' 'DAYS_LAST_PHONE_CHANGE'] The shape of cleaned dataset is (307511, 122)
# make a copy of cleaned data for building
train_df = application_cleaned.copy()
print(train_df.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 307511 entries, 0 to 307510 Columns: 122 entries, SK_ID_CURR to EMERGENCYSTATE_MODE dtypes: float64(65), int64(41), object(16) memory usage: 286.2+ MB None
# Applied the K-modes for clustering loan defaulter and summerize their character
# Select the loan defaulter data
defaulter_df = application_cleaned[application_cleaned['TARGET'] == 1]
print(defaulter_df.shape)
# Select features relevant to user profiling
profiling_feature_list = ['CODE_GENDER','AMT_INCOME_TOTAL','AMT_CREDIT','NAME_INCOME_TYPE',
'NAME_EDUCATION_TYPE','YEARS_BIRTH','YEARS_EMPLOYED','OCCUPATION_TYPE',
'ORGANIZATION_TYPE', 'CNT_CHILDREN','NAME_TYPE_SUITE','NAME_FAMILY_STATUS',
'NAME_HOUSING_TYPE','CNT_FAM_MEMBERS','FLAG_OWN_CAR','FLAG_OWN_REALTY',
'OWN_CAR_AGE','REGION_RATING_CLIENT_W_CITY','NAME_CONTRACT_TYPE',
'AMT_ANNUITY','AMT_GOODS_PRICE']
defaulter_profiling_df = defaulter_df[profiling_feature_list]
print(defaulter_profiling_df.info())
print(defaulter_profiling_df.isnull().any())
(24825, 122) <class 'pandas.core.frame.DataFrame'> Index: 24825 entries, 0 to 307509 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CODE_GENDER 24825 non-null object 1 AMT_INCOME_TOTAL 24825 non-null float64 2 AMT_CREDIT 24825 non-null float64 3 NAME_INCOME_TYPE 24825 non-null object 4 NAME_EDUCATION_TYPE 24825 non-null object 5 YEARS_BIRTH 24825 non-null int64 6 YEARS_EMPLOYED 24825 non-null int64 7 OCCUPATION_TYPE 24825 non-null object 8 ORGANIZATION_TYPE 24825 non-null object 9 CNT_CHILDREN 24825 non-null int64 10 NAME_TYPE_SUITE 24825 non-null object 11 NAME_FAMILY_STATUS 24825 non-null object 12 NAME_HOUSING_TYPE 24825 non-null object 13 CNT_FAM_MEMBERS 24825 non-null float64 14 FLAG_OWN_CAR 24825 non-null object 15 FLAG_OWN_REALTY 24825 non-null object 16 OWN_CAR_AGE 24825 non-null float64 17 REGION_RATING_CLIENT_W_CITY 24825 non-null int64 18 NAME_CONTRACT_TYPE 24825 non-null object 19 AMT_ANNUITY 24825 non-null float64 20 AMT_GOODS_PRICE 24825 non-null float64 dtypes: float64(6), int64(4), object(11) memory usage: 4.2+ MB None CODE_GENDER False AMT_INCOME_TOTAL False AMT_CREDIT False NAME_INCOME_TYPE False NAME_EDUCATION_TYPE False YEARS_BIRTH False YEARS_EMPLOYED False OCCUPATION_TYPE False ORGANIZATION_TYPE False CNT_CHILDREN False NAME_TYPE_SUITE False NAME_FAMILY_STATUS False NAME_HOUSING_TYPE False CNT_FAM_MEMBERS False FLAG_OWN_CAR False FLAG_OWN_REALTY False OWN_CAR_AGE False REGION_RATING_CLIENT_W_CITY False NAME_CONTRACT_TYPE False AMT_ANNUITY False AMT_GOODS_PRICE False dtype: bool
cat_col = defaulter_profiling_df.select_dtypes(include=['category']).columns.values
print(cat_col)
for column in cat_col:
defaulter_profiling_df = defaulter_profiling_df.astype({column:'object'})
# Double check the dataframe and ensure there is no NaN value in it.
print(defaulter_profiling_df.info())
defaulter_profiling_df.fillna('Unknown', inplace=True)
print(defaulter_profiling_df.head())
[]
<class 'pandas.core.frame.DataFrame'>
Index: 24825 entries, 0 to 307509
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CODE_GENDER 24825 non-null object
1 AMT_INCOME_TOTAL 24825 non-null float64
2 AMT_CREDIT 24825 non-null float64
3 NAME_INCOME_TYPE 24825 non-null object
4 NAME_EDUCATION_TYPE 24825 non-null object
5 YEARS_BIRTH 24825 non-null int64
6 YEARS_EMPLOYED 24825 non-null int64
7 OCCUPATION_TYPE 24825 non-null object
8 ORGANIZATION_TYPE 24825 non-null object
9 CNT_CHILDREN 24825 non-null int64
10 NAME_TYPE_SUITE 24825 non-null object
11 NAME_FAMILY_STATUS 24825 non-null object
12 NAME_HOUSING_TYPE 24825 non-null object
13 CNT_FAM_MEMBERS 24825 non-null float64
14 FLAG_OWN_CAR 24825 non-null object
15 FLAG_OWN_REALTY 24825 non-null object
16 OWN_CAR_AGE 24825 non-null float64
17 REGION_RATING_CLIENT_W_CITY 24825 non-null int64
18 NAME_CONTRACT_TYPE 24825 non-null object
19 AMT_ANNUITY 24825 non-null float64
20 AMT_GOODS_PRICE 24825 non-null float64
dtypes: float64(6), int64(4), object(11)
memory usage: 4.2+ MB
None
CODE_GENDER AMT_INCOME_TOTAL AMT_CREDIT NAME_INCOME_TYPE \
0 M 202500.0 406597.5 Working
26 F 112500.0 979992.0 Working
40 M 202500.0 1193580.0 Commercial associate
42 F 135000.0 288873.0 Working
81 F 81000.0 252000.0 Pensioner
NAME_EDUCATION_TYPE YEARS_BIRTH YEARS_EMPLOYED \
0 Secondary / secondary special 26 2
26 Secondary / secondary special 52 8
40 Secondary / secondary special 48 4
42 Secondary / secondary special 37 10
81 Secondary / secondary special 68 0
OCCUPATION_TYPE ORGANIZATION_TYPE CNT_CHILDREN NAME_TYPE_SUITE \
0 Laborers Business Entity Type 3 0 Unaccompanied
26 Cooking staff Business Entity Type 3 0 Unaccompanied
40 Laborers Business Entity Type 3 0 Unaccompanied
42 Sales staff Self-employed 0 Unaccompanied
81 None XNA 0 Unaccompanied
NAME_FAMILY_STATUS NAME_HOUSING_TYPE CNT_FAM_MEMBERS FLAG_OWN_CAR \
0 Single / not married House / apartment 1.0 N
26 Widow House / apartment 1.0 N
40 Married House / apartment 2.0 N
42 Civil marriage House / apartment 2.0 N
81 Married House / apartment 2.0 N
FLAG_OWN_REALTY OWN_CAR_AGE REGION_RATING_CLIENT_W_CITY \
0 Y -1.0 2
26 Y -1.0 2
40 Y -1.0 2
42 N -1.0 3
81 Y -1.0 2
NAME_CONTRACT_TYPE AMT_ANNUITY AMT_GOODS_PRICE
0 Cash loans 24700.5 351000.0
26 Cash loans 27076.5 702000.0
40 Cash loans 35028.0 855000.0
42 Cash loans 16258.5 238500.0
81 Cash loans 14593.5 252000.0
temp_df = defaulter_profiling_df.copy()
for col_name in temp_df.select_dtypes('object').columns.values:
value_list = temp_df[col_name].unique()
for index, value in enumerate(value_list):
temp_df[col_name].replace(value, index, inplace=True)
print(temp_df.head())
pca = PCA()
pca.fit(temp_df)
pca_samples = pca.transform(temp_df)
variance_ratio = pca.explained_variance_ratio_
print('explained_variance_ratio_:{}'.format(pca.explained_variance_ratio_))
CODE_GENDER AMT_INCOME_TOTAL AMT_CREDIT NAME_INCOME_TYPE \
0 0 202500.0 406597.5 0
26 1 112500.0 979992.0 0
40 0 202500.0 1193580.0 1
42 1 135000.0 288873.0 0
81 1 81000.0 252000.0 2
NAME_EDUCATION_TYPE YEARS_BIRTH YEARS_EMPLOYED OCCUPATION_TYPE \
0 0 26 2 0
26 0 52 8 1
40 0 48 4 0
42 0 37 10 2
81 0 68 0 3
ORGANIZATION_TYPE CNT_CHILDREN NAME_TYPE_SUITE NAME_FAMILY_STATUS \
0 0 0 0 0
26 0 0 0 1
40 0 0 0 2
42 1 0 0 3
81 2 0 0 2
NAME_HOUSING_TYPE CNT_FAM_MEMBERS FLAG_OWN_CAR FLAG_OWN_REALTY \
0 0 1.0 0 0
26 0 1.0 0 0
40 0 2.0 0 0
42 0 2.0 0 1
81 0 2.0 0 0
OWN_CAR_AGE REGION_RATING_CLIENT_W_CITY NAME_CONTRACT_TYPE AMT_ANNUITY \
0 -1.0 2 0 24700.5
26 -1.0 2 0 27076.5
40 -1.0 2 0 35028.0
42 -1.0 3 0 16258.5
81 -1.0 2 0 14593.5
AMT_GOODS_PRICE
0 351000.0
26 702000.0
40 855000.0
42 238500.0
81 252000.0
explained_variance_ratio_:[7.20106232e-01 2.77431266e-01 2.37682929e-03 8.56728973e-05
1.97649457e-10 1.50398523e-10 1.18913894e-10 3.13468424e-11
1.82335272e-11 2.02741449e-12 1.11136049e-12 9.23583315e-13
8.74863141e-13 7.81614235e-13 7.39419943e-13 3.20726340e-13
2.99917174e-13 2.55763113e-13 1.23648192e-13 8.11976709e-14
6.75379014e-14]
# Visualize the explained_variance_ratio
fig, ax = plt.subplots(figsize=(14, 5))
sns.set(font_scale=1)
plt.step(range(temp_df.shape[1]), pca.explained_variance_ratio_.cumsum(), where='mid',
label='cumulative explained variance')
sns.barplot(x=np.arange(1, temp_df.shape[1]+1), y=pca.explained_variance_ratio_, alpha=0.5, color='g', label='individual explained variance')
plt.xlim(0, 20)
ax.set_xticklabels([s if int(s.get_text())%2 == 0 else '' for s in ax.get_xticklabels()])
plt.ylabel('Explained variance', **content_font)
plt.xlabel('Principal components', **content_font)
plt.legend(loc='upper right', fontsize = 10)
<matplotlib.legend.Legend at 0x17c6f62d0>
from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt
import numpy as np
# Plot learning curve function
def plot_learning_curve(estimator, title, X, y, cv, train_sizes):
plt.figure()
plt.title(title)
plt.xlabel("Training examples")
plt.ylabel("Score")
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, cv=cv, n_jobs=1, train_sizes=train_sizes
)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1, color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r", label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g", label="Cross-validation score")
plt.legend(loc="best")
return plt
# Define the RandomForestClassifier
rfc_best = RandomForestClassifier(n_estimators=120, max_depth=3, random_state=1)
# Plot the learning curve
rfc_learning_curve = plot_learning_curve(rfc_best, "Random forest classification learning curves",
X_train, Y_train, cv=5,
train_sizes=[0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])
rfc_learning_curve.show()
As we can see from the diagram, the training and cross-validation score converge towards the same limit when the training samples increase, and the score performance is good. This means the prediction model neither has overfitted nor underfit problems.
(if the model has to underfit problems, both score performance of training and cross_validation dataset will be low;
On the other hand, if the model has overfitted problems, the training and cross-validation score converge towards the same limit but there is a obvious gap between them when the training example reached its maximum value)
# test the prediction performance using the X-train dataset
rfc_best.fit(X_train, Y_train)
Y_predict = rfc_best.predict_proba(X_test)[:,1]
print(Y_predict)
print(rfc_best.classes_)
print(roc_auc_score(Y_test, Y_predict))
[0.06415442 0.09483616 0.08829253 ... 0.08948133 0.06426616 0.07304601] [0 1] 0.7093644114303688
Here are the instructions for evaluating model performance through roc_auc_score:
1、roc_auc_score=0.5, the model has no diagnostic value because the model just randomly 'guesses' a value as the prediction result.
2、0.5 < roc_auc_score < 0.7, the model can predict the tendency of loan defaulters, but the precision rate is not good at all.
3、0.7 < roc_auc_score < 0.9, the model can predict the tendency of loan defaulters with a reasonably good precision rate.
4、0.9 < roc_auc_score < 1, the model can predict whether the applicant is a loan defaulter at a high precision rate
# Print out the ro curve for reference
fpr, tpr, thresholds = roc_curve(Y_test, Y_predict)
result = pd.DataFrame([thresholds,tpr,fpr], index= ['threshold','TPR','FPR'])
print(result)
0 1 2 3 4 5 \
threshold 1.139553 0.139553 0.138082 0.138060 0.137671 0.137311
TPR 0.000000 0.000000 0.000000 0.000160 0.000160 0.000479
FPR 0.000000 0.000014 0.000028 0.000028 0.000042 0.000042
6 7 8 9 10 11 \
threshold 0.136927 0.136058 0.136041 0.135932 0.135757 0.135726
TPR 0.000479 0.000799 0.000799 0.001118 0.001118 0.001278
FPR 0.000057 0.000057 0.000071 0.000071 0.000099 0.000099
12 13 14 15 16 17 \
threshold 0.135544 0.134542 0.134397 0.134008 0.133912 0.133830
TPR 0.001278 0.002077 0.002077 0.002396 0.002396 0.002556
FPR 0.000113 0.000113 0.000170 0.000170 0.000184 0.000184
18 19 20 21 22 23 \
threshold 0.133692 0.133535 0.133101 0.133079 0.133018 0.132933
TPR 0.002556 0.002716 0.002716 0.002875 0.002875 0.003035
FPR 0.000212 0.000212 0.000283 0.000283 0.000326 0.000326
24 25 26 27 28 29 \
threshold 0.132548 0.132545 0.132304 0.132231 0.131794 0.131758
TPR 0.003035 0.003195 0.003195 0.003355 0.003355 0.003674
FPR 0.000411 0.000411 0.000439 0.000439 0.000581 0.000581
30 31 32 33 34 35 \
threshold 0.131720 0.131709 0.131697 0.131647 0.131338 0.131177
TPR 0.003674 0.003834 0.003834 0.003994 0.003994 0.004792
FPR 0.000595 0.000595 0.000609 0.000609 0.000736 0.000736
36 37 38 39 40 41 \
threshold 0.131128 0.130848 0.130789 0.130645 0.130637 0.130607
TPR 0.004792 0.005591 0.005591 0.006230 0.006230 0.006390
FPR 0.000793 0.000793 0.000835 0.000835 0.000850 0.000850
42 43 44 45 46 47 \
threshold 0.130516 0.130291 0.130114 0.130090 0.130008 0.129994
TPR 0.006390 0.006709 0.006709 0.006869 0.006869 0.007029
FPR 0.000906 0.000906 0.001005 0.001005 0.001048 0.001048
48 49 50 51 52 53 \
threshold 0.129988 0.129956 0.129924 0.129898 0.129891 0.129890
TPR 0.007029 0.007188 0.007188 0.007348 0.007348 0.007508
FPR 0.001062 0.001062 0.001076 0.001076 0.001105 0.001105
54 55 56 57 58 59 \
threshold 0.129878 0.129770 0.129764 0.129733 0.129723 0.129716
TPR 0.007508 0.007827 0.007827 0.008147 0.008147 0.008307
FPR 0.001119 0.001119 0.001133 0.001133 0.001161 0.001161
60 61 62 63 64 65 \
threshold 0.129692 0.129646 0.129624 0.129539 0.129523 0.129512
TPR 0.008307 0.008466 0.008466 0.008946 0.008946 0.009105
FPR 0.001189 0.001189 0.001218 0.001218 0.001232 0.001232
66 67 68 69 70 71 \
threshold 0.129510 0.129503 0.129359 0.129247 0.129181 0.129163
TPR 0.009105 0.009265 0.009265 0.009585 0.009585 0.009904
FPR 0.001246 0.001246 0.001274 0.001274 0.001317 0.001317
72 73 74 75 76 77 \
threshold 0.129144 0.129123 0.129063 0.129021 0.128978 0.128957
TPR 0.009904 0.010064 0.010064 0.010383 0.010383 0.010543
FPR 0.001331 0.001331 0.001345 0.001345 0.001374 0.001374
78 79 80 81 82 83 \
threshold 0.128953 0.128941 0.128857 0.128822 0.128815 0.128771
TPR 0.010543 0.010703 0.010703 0.010863 0.010863 0.011022
FPR 0.001388 0.001388 0.001430 0.001430 0.001459 0.001459
84 85 86 87 88 89 \
threshold 0.128732 0.128715 0.128710 0.128686 0.128592 0.128522
TPR 0.011022 0.011182 0.011182 0.011342 0.011342 0.011981
FPR 0.001473 0.001473 0.001487 0.001487 0.001544 0.001544
90 91 92 93 94 95 \
threshold 0.128389 0.128381 0.128304 0.128270 0.128267 0.128254
TPR 0.011981 0.012141 0.012141 0.012300 0.012300 0.012620
FPR 0.001586 0.001586 0.001628 0.001628 0.001643 0.001643
96 97 98 99 ... 11161 11162 \
threshold 0.128235 0.128234 0.128189 0.128155 ... 0.054493 0.054483
TPR 0.012620 0.012780 0.012780 0.012939 ... 0.992971 0.992971
FPR 0.001685 0.001685 0.001699 0.001699 ... 0.962163 0.962248
11163 11164 11165 11166 11167 11168 \
threshold 0.054483 0.054333 0.054332 0.054326 0.054325 0.054306
TPR 0.993131 0.993131 0.993131 0.993131 0.993291 0.993291
FPR 0.962248 0.963664 0.963692 0.963763 0.963763 0.963989
11169 11170 11171 11172 11173 11174 \
threshold 0.054305 0.054257 0.054257 0.054248 0.054242 0.054213
TPR 0.993291 0.993291 0.993450 0.993450 0.993610 0.993610
FPR 0.964018 0.964542 0.964542 0.964683 0.964683 0.964924
11175 11176 11177 11178 11179 11180 \
threshold 0.054213 0.054202 0.054201 0.054166 0.054166 0.054085
TPR 0.993610 0.993610 0.993770 0.993770 0.993930 0.993930
FPR 0.964952 0.965151 0.965151 0.965533 0.965533 0.966312
11181 11182 11183 11184 11185 11186 \
threshold 0.054084 0.054037 0.054036 0.053910 0.053909 0.053761
TPR 0.994089 0.994089 0.994249 0.994249 0.994409 0.994409
FPR 0.966312 0.966737 0.966737 0.968039 0.968039 0.969356
11187 11188 11189 11190 11191 11192 \
threshold 0.053761 0.053686 0.053684 0.053518 0.053515 0.053490
TPR 0.994569 0.994569 0.994728 0.994728 0.994888 0.994888
FPR 0.969356 0.969993 0.969993 0.971268 0.971268 0.971452
11193 11194 11195 11196 11197 11198 \
threshold 0.053490 0.053224 0.053224 0.053174 0.053173 0.053073
TPR 0.994888 0.994888 0.994888 0.994888 0.995048 0.995048
FPR 0.971480 0.973704 0.973732 0.974058 0.974058 0.975091
11199 11200 11201 11202 11203 11204 \
threshold 0.053069 0.052989 0.052989 0.052909 0.052908 0.052719
TPR 0.995208 0.995208 0.995367 0.995367 0.995527 0.995527
FPR 0.975091 0.975714 0.975714 0.976437 0.976437 0.977824
11205 11206 11207 11208 11209 11210 \
threshold 0.052718 0.052603 0.052603 0.052208 0.052203 0.052129
TPR 0.995687 0.995687 0.995847 0.995847 0.996006 0.996006
FPR 0.977824 0.978532 0.978532 0.980869 0.980869 0.981237
11211 11212 11213 11214 11215 11216 \
threshold 0.052127 0.052103 0.052103 0.052046 0.052043 0.052002
TPR 0.996166 0.996166 0.996326 0.996326 0.996486 0.996486
FPR 0.981237 0.981449 0.981449 0.981818 0.981818 0.982214
11217 11218 11219 11220 11221 11222 \
threshold 0.052002 0.051960 0.051955 0.051608 0.051608 0.051459
TPR 0.996645 0.996645 0.996805 0.996805 0.996965 0.996965
FPR 0.982214 0.982455 0.982455 0.984310 0.984310 0.985202
11223 11224 11225 11226 11227 11228 \
threshold 0.051457 0.051423 0.051420 0.051289 0.051288 0.051168
TPR 0.997125 0.997125 0.997284 0.997284 0.997444 0.997444
FPR 0.985202 0.985315 0.985315 0.985981 0.985981 0.986576
11229 11230 11231 11232 11233 11234 \
threshold 0.051164 0.051148 0.051132 0.050988 0.050988 0.050729
TPR 0.997604 0.997604 0.997764 0.997764 0.997923 0.997923
FPR 0.986576 0.986675 0.986675 0.987270 0.987270 0.988516
11235 11236 11237 11238 11239 11240 \
threshold 0.050729 0.050548 0.050548 0.050385 0.050385 0.050278
TPR 0.998083 0.998083 0.998243 0.998243 0.998403 0.998403
FPR 0.988516 0.989521 0.989521 0.990272 0.990272 0.990810
11241 11242 11243 11244 11245 11246 \
threshold 0.050277 0.050024 0.050022 0.049769 0.049768 0.049446
TPR 0.998562 0.998562 0.998722 0.998722 0.998882 0.998882
FPR 0.990810 0.991928 0.991928 0.992934 0.992934 0.994010
11247 11248 11249 11250 11251 11252 \
threshold 0.049445 0.049227 0.049220 0.049044 0.049042 0.048570
TPR 0.999042 0.999042 0.999201 0.999201 0.999361 0.999361
FPR 0.994010 0.994676 0.994676 0.995171 0.995171 0.996191
11253 11254 11255 11256 11257 11258 \
threshold 0.048567 0.048354 0.048332 0.048078 0.048060 0.047737
TPR 0.999521 0.999521 0.999681 0.999681 0.999840 0.999840
FPR 0.996191 0.996616 0.996616 0.996984 0.996984 0.997451
11259 11260
threshold 0.047731 0.040998
TPR 1.000000 1.000000
FPR 0.997451 1.000000
[3 rows x 11261 columns]
plt.figure()
plt.scatter(fpr, tpr)
# plot the roc curve
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve')
plt.xlim([-0.05, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
<matplotlib.legend.Legend at 0x17d48cfd0>
# print out the feature importance
feature_importance_df = pd.DataFrame(data=rfc_best.feature_importances_, index=rfc_best.feature_names_in_)
feature_importance_df.sort_values(0, ascending=False, inplace=True)
print(feature_importance_df)
0 EXT_SOURCE_2 0.181941 EXT_SOURCE_3 0.135399 EXT_SOURCE_1 0.073067 YEARS_BIRTH 0.058044 YEARS_EMPLOYED 0.037148 ... ... NAME_HOUSING_TYPE_Co-op apartment 0.000000 NAME_HOUSING_TYPE_Municipal apartment 0.000000 NAME_HOUSING_TYPE_Office apartment 0.000000 NAME_HOUSING_TYPE_Rented apartment 0.000000 SK_ID_CURR 0.000000 [251 rows x 1 columns]
Decisive Factor whether an applicant will be a defaulter:
In terms of the loan defaulter data, there are no clear boundaries to classify the loan defaulters, so I summarize the top 10 characteristics showing up most frequently in loan defaulters.
NAME_CONTRACT_TYPE_Cash loans,
NAME_HOUSING_TYPE_House / apartment,
NAME_TYPE_SUITE_Unaccopanied,
NAME_EDUCATION_TYPE_Secondary/secondary special,
REGION_RATING_CLIENT_W_CITY_2,
OWN_CAR_AGE_Unkown,
FLAG_OWN_CAR_N,
FLAG_OWN_REALITY_Y,
CNT_CHILDREN_No_Child,
NAME_INCOME_TYPE_Working.
To predict whether a client is a loan defaulter or not, the random forest with the best parameter setting can give a reasonably good reference according to the auc_roc_score.